DAX LLM Benchmark
Which LLM is best at DAX?
DAXBench tests how models understand, write, and reason about DAX and Power BI.
Methodology designed by Maxim Anatsko.
Last updated: Mar 6, 2026
91 models · 30 tasks · Initial Release
Model Leaderboard
Ranked by score
| Model | ||
|---|---|---|
1 | Gemini 3.1 Flash Lite PreviewHIGH Google | 97.4% |
2 | GPT-5.3 Chat OpenAI | 96.2% |
3 | GLM 5 Z.AI | 96.2% |
4 | Gemini 3.1 Pro PreviewHIGH Google | 93.8% |
5 | Qwen3.5-FlashMED Qwen | 93.2% |
6 | Qwen3.5 397B A17B Qwen | 90.3% |
7 | Qwen3.5 Plus 2026-02-15MED Qwen | 89.7% |
8 | GPT-5.3-CodexHIGH OpenAI | 88.6% |
9 | gpt-oss-120b OpenAI | 85.6% |
10 | GPT-5.1-Codex-Max OpenAI | 85.0% |
11 | Claude Sonnet 4.6MED Anthropic | 84.5% |
12 | Gemini 3 Flash Preview Google | 83.8% |
13 | Gemini 2.5 Flash Preview 09-2025 Google | 83.8% |
14 | GPT-5.4HIGH OpenAI | 83.2% |
15 | Claude Opus 4.5 Anthropic | 82.7% |
16 | Claude Opus 4.6 Anthropic | 82.0% |
17 | R1 DeepSeek | 81.3% |
18 | Claude Sonnet 4 Anthropic | 81.3% |
19 | Gemini 3 Pro Preview Google | 81.3% |
20 | o3 OpenAI | 80.2% |
21 | GPT-5.2 Chat OpenAI | 80.1% |
22 | DeepSeek V3.2 DeepSeek | 79.9% |
23 | GPT-5.2 OpenAI | 78.4% |
24 | Kimi K2 Thinking Moonshot AI | 78.4% |
25 | Aurora Alpha Openrouter | 78.2% |
26 | Claude Sonnet 4.5 Anthropic | 77.9% |
27 | Grok 4 xAI | 76.7% |
28 | Gemini 2.0 Flash Google | 76.6% |
29 | Gemini 2.0 Flash Experimental (free) Google | 76.6% |
30 | o4 Mini OpenAI | 76.5% |
31 | Gemini 2.5 Flash Google | 76.0% |
32 | DeepSeek V3.1 DeepSeek | 75.9% |
33 | DeepSeek V3.2 Speciale DeepSeek | 74.7% |
34 | Llama 4 Maverick Meta | 74.4% |
35 | GPT-4o-mini (2024-07-18) OpenAI | 74.1% |
36 | R1 0528 DeepSeek | 73.7% |
37 | Grok Code Fast 1 xAI | 73.7% |
38 | Nova Premier 1.0 Amazon | 73.5% |
39 | DeepSeek V3 0324 DeepSeek | 73.5% |
40 | KAT-Coder-Pro V1 (free) Kwaipilot | 72.8% |
41 | Mercury 2HIGH Inception | 71.9% |
42 | DeepSeek R1T2 Chimera (free) TNG | 71.8% |
43 | DeepSeek V3.1 Nex N1 (free) Nex AGI | 71.8% |
44 | Qwen3 Coder 480B A35B (free) Qwen | 71.8% |
45 | GPT-4o-mini OpenAI | 71.7% |
46 | GPT-5.1 OpenAI | 71.3% |
47 | GPT-4o (2024-11-20) OpenAI | 70.0% |
48 | GPT-5.1-Codex-Mini OpenAI | 69.9% |
49 | Qwen3.5-122B-A10BMED Qwen | 69.4% |
50 | Palmyra X5 Writer | 69.4% |
Changelog
Loading...
Loading value analysis...
About This Benchmark
Evaluation Method
Models are tested against DAX tasks of varying complexity using the Contoso sample dataset. Responses are evaluated for syntax correctness and output accuracy.
Scoring System
Harder tasks are worth more points. Correct solutions also earn bonus points for following DAX best practices, writing efficient code, and producing clear, readable output.
Task Categories
Tasks cover aggregation, time intelligence, filtering, calculations, iterators, and context transitions across basic, intermediate, and advanced levels.