DAX LLM Benchmark
Which LLM is best at DAX?
DAXBench tests how models understand, write, and reason about DAX and Power BI.
Methodology designed by Maxim Anatsko.
Last updated: Jan 22, 2026
70 models · 30 tasks · Initial Release
Model Leaderboard
Ranked by score
| Model | ||
|---|---|---|
1 | gpt-oss-120b OpenAI | 85.3% |
2 | GPT-5.1-Codex-Max OpenAI | 84.7% |
3 | Gemini 2.5 Flash Preview 09-2025 Google | 83.5% |
4 | Claude Opus 4.5 Anthropic | 82.5% |
5 | Gemini 3 Flash Preview Google | 80.6% |
6 | o3 OpenAI | 80.0% |
7 | GPT-5.2 Chat OpenAI | 79.9% |
8 | DeepSeek V3.2 DeepSeek | 79.4% |
9 | GPT-5.2 OpenAI | 78.2% |
10 | R1 DeepSeek | 78.2% |
11 | Gemini 3 Pro Preview Google | 77.9% |
12 | Claude Sonnet 4 Anthropic | 77.6% |
13 | Claude Sonnet 4.5 Anthropic | 77.5% |
14 | Grok 4 xAI | 76.4% |
15 | Gemini 2.0 Flash Experimental (free) Google | 76.3% |
16 | o4 Mini OpenAI | 76.3% |
17 | Gemini 2.0 Flash Google | 76.2% |
18 | Gemini 2.5 Flash Google | 75.8% |
19 | DeepSeek V3.1 DeepSeek | 75.6% |
20 | DeepSeek V3.2 Speciale DeepSeek | 74.5% |
21 | Kimi K2 Thinking MoonshotAI | 74.3% |
22 | Llama 4 Maverick Meta | 74.1% |
23 | GPT-4o-mini (2024-07-18) OpenAI | 73.6% |
24 | R1 0528 DeepSeek | 73.5% |
25 | DeepSeek V3 0324 DeepSeek | 73.3% |
26 | Nova Premier 1.0 Amazon | 73.2% |
27 | KAT-Coder-Pro V1 (free) Kwaipilot | 72.6% |
28 | DeepSeek V3.1 Nex N1 (free) Nex AGI | 71.7% |
29 | DeepSeek R1T2 Chimera (free) TNG | 71.6% |
30 | Qwen3 Coder 480B A35B (free) Qwen | 71.6% |
31 | GPT-4o-mini OpenAI | 71.4% |
32 | GPT-5.1 OpenAI | 71.1% |
33 | GPT-4o (2024-11-20) OpenAI | 69.8% |
34 | Grok Code Fast 1 xAI | 69.7% |
35 | GPT-5.1-Codex-Mini OpenAI | 69.6% |
36 | Sonar Reasoning Pro Perplexity | 69.2% |
37 | Palmyra X5 Writer | 69.1% |
38 | Gemini 2.5 Pro Google | 68.2% |
39 | MiniMax M2 MiniMax | 68.1% |
40 | GLM 4.7 Z.AI | 67.0% |
41 | Gemini 2.5 Flash Lite Preview 09-2025 Google | 66.6% |
42 | Mistral Large 3 2512 Mistral | 65.7% |
43 | Nemotron 3 Nano 30b A3b (free) Nvidia | 65.6% |
44 | GPT-5 Nano OpenAI | 62.1% |
45 | GPT-3.5 Turbo OpenAI | 61.3% |
46 | GPT-4.1 Mini OpenAI | 60.7% |
47 | Devstral 2 2512 (free) Mistral | 60.7% |
48 | Gemini 2.0 Flash Lite Google | 59.0% |
49 | GLM 4.7 Flash Z.AI | 57.9% |
50 | GPT-5 Mini OpenAI | 57.1% |
Changelog
Loading...
Loading value analysis...
About This Benchmark
Evaluation Method
Models are tested against DAX tasks of varying complexity using the Contoso sample dataset. Responses are evaluated for syntax correctness and output accuracy.
Scoring System
Harder tasks are worth more points. Correct solutions also earn bonus points for following DAX best practices, writing efficient code, and producing clear, readable output.
Task Categories
Tasks cover aggregation, time intelligence, filtering, calculations, iterators, and context transitions across basic, intermediate, and advanced levels.