DAX LLM Benchmark
Which LLM is best at DAX?
DAXBench tests how models understand, write, and reason about DAX and Power BI.
Methodology designed by Maxim Anatsko.
Last updated: Dec 6, 2025
33 models · 30 tasks· Initial Release
Model Leaderboard
Ranked by score
| Model | Tasks | ||||
|---|---|---|---|---|---|
1 | Gemini 2.5 Flash Preview 09-2025Google | 79.7% | 76.7% | 100.0% | 23/30 |
2 | DeepSeek V3.2DeepSeek | 79.4% | 80.0% | 100.0% | 24/30 |
3 | Claude Opus 4.5Anthropic | 78.7% | 76.7% | 100.0% | 23/30 |
4 | gpt-oss-120bOpenAI | 78.3% | 76.7% | 96.7% | 23/30 |
5 | Claude Sonnet 4.5Anthropic | 77.5% | 76.7% | 100.0% | 23/30 |
6 | Gemini 2.0 FlashGoogle | 76.2% | 73.3% | 100.0% | 22/30 |
7 | GPT-4o-mini (2024-07-18)OpenAI | 73.6% | 73.3% | 100.0% | 22/30 |
8 | Gemini 2.5 FlashGoogle | 72.0% | 70.0% | 100.0% | 21/30 |
9 | DeepSeek V3 0324DeepSeek | 70.2% | 70.0% | 100.0% | 21/30 |
10 | Grok Code Fast 1xAI | 69.7% | 66.7% | 100.0% | 20/30 |
11 | Mistral Large 3 2512Mistral | 69.0% | 66.7% | 100.0% | 20/30 |
12 | Gemini 2.5 Flash Lite Preview 09-2025Google | 66.6% | 63.3% | 100.0% | 19/30 |
13 | GPT-5.1-Codex-MiniOpenAI | 65.3% | 66.7% | 93.3% | 20/30 |
14 | Gemini 2.5 ProGoogle | 64.4% | 63.3% | 80.0% | 19/30 |
15 | GPT-5.1OpenAI | 64.2% | 60.0% | 96.7% | 18/30 |
16 | DeepSeek R1T2 Chimera (free)TNG | 63.3% | 63.3% | 80.0% | 19/30 |
17 | MiniMax M2MiniMax | 61.4% | 63.3% | 73.3% | 19/30 |
18 | GPT-3.5 TurboOpenAI | 61.3% | 60.0% | 100.0% | 18/30 |
19 | GPT-4.1 MiniOpenAI | 60.7% | 60.0% | 100.0% | 18/30 |
20 | Kimi K2 ThinkingMoonshotAI | 60.7% | 63.3% | 73.3% | 19/30 |
21 | Qwen3 Coder 480B A35B (free)Qwen | 59.7% | 60.0% | 90.0% | 18/30 |
22 | Gemini 2.0 Flash LiteGoogle | 59.0% | 53.3% | 100.0% | 16/30 |
23 | Gemini 3 Pro PreviewGoogle | 58.5% | 60.0% | 70.0% | 18/30 |
24 | GPT-5 MiniOpenAI | 57.1% | 56.7% | 100.0% | 17/30 |
25 | Claude Haiku 4.5Anthropic | 57.0% | 56.7% | 100.0% | 17/30 |
26 | GPT-5 NanoOpenAI | 53.7% | 53.3% | 83.3% | 16/30 |
27 | gpt-oss-20b (free)OpenAI | 53.5% | 53.3% | 90.0% | 16/30 |
28 | Gemini 2.0 Flash Experimental (free)Google | 51.3% | 46.7% | 63.3% | 14/30 |
29 | Gemma 3 27B (free)Google | 46.6% | 43.3% | 100.0% | 13/30 |
30 | Nova 2 Lite (free)Amazon | 46.2% | 46.7% | 93.3% | 14/30 |
31 | GLM 4.5 Air (free)Z.AI | 45.4% | 46.7% | 66.7% | 14/30 |
32 | DeepSeek V3.2 SpecialeDeepSeek | 38.7% | 43.3% | 53.3% | 13/30 |
33 | Phi 4Microsoft | 20.2% | 16.7% | 56.7% | 5/30 |
About This Benchmark
Evaluation Method
Models are tested against DAX tasks of varying complexity using the Contoso sample dataset. Responses are evaluated for syntax correctness and output accuracy.
Scoring System
Harder tasks are worth more points. Correct solutions also earn bonus points for following DAX best practices, writing efficient code, and producing clear, readable output.
Task Categories
Tasks cover aggregation, time intelligence, filtering, calculations, iterators, and context transitions across basic, intermediate, and advanced levels.