DAX LLM Benchmark

Which LLM is best at DAX?
DAXBench tests how models understand, write, and reason about DAX and Power BI.
Methodology designed by Maxim Anatsko.

Last updated: Jan 22, 2026

70 models · 30 tasks · Initial Release

Model Leaderboard

Ranked by score

Model
1
gpt-oss-120b
OpenAI
85.3%
2
GPT-5.1-Codex-Max
OpenAI
84.7%
3
Gemini 2.5 Flash Preview 09-2025
Google
83.5%
4
Claude Opus 4.5
Anthropic
82.5%
5
Gemini 3 Flash Preview
Google
80.6%
6
o3
OpenAI
80.0%
7
GPT-5.2 Chat
OpenAI
79.9%
8
DeepSeek V3.2
DeepSeek
79.4%
9
GPT-5.2
OpenAI
78.2%
10
R1
DeepSeek
78.2%
11
Gemini 3 Pro Preview
Google
77.9%
12
Claude Sonnet 4
Anthropic
77.6%
13
Claude Sonnet 4.5
Anthropic
77.5%
14
Grok 4
xAI
76.4%
15
Gemini 2.0 Flash Experimental (free)
Google
76.3%
16
o4 Mini
OpenAI
76.3%
17
Gemini 2.0 Flash
Google
76.2%
18
Gemini 2.5 Flash
Google
75.8%
19
DeepSeek V3.1
DeepSeek
75.6%
20
DeepSeek V3.2 Speciale
DeepSeek
74.5%
21
Kimi K2 Thinking
MoonshotAI
74.3%
22
Llama 4 Maverick
Meta
74.1%
23
GPT-4o-mini (2024-07-18)
OpenAI
73.6%
24
R1 0528
DeepSeek
73.5%
25
DeepSeek V3 0324
DeepSeek
73.3%
26
Nova Premier 1.0
Amazon
73.2%
27
KAT-Coder-Pro V1 (free)
Kwaipilot
72.6%
28
DeepSeek V3.1 Nex N1 (free)
Nex AGI
71.7%
29
DeepSeek R1T2 Chimera (free)
TNG
71.6%
30
Qwen3 Coder 480B A35B (free)
Qwen
71.6%
31
GPT-4o-mini
OpenAI
71.4%
32
GPT-5.1
OpenAI
71.1%
33
GPT-4o (2024-11-20)
OpenAI
69.8%
34
Grok Code Fast 1
xAI
69.7%
35
GPT-5.1-Codex-Mini
OpenAI
69.6%
36
Sonar Reasoning Pro
Perplexity
69.2%
37
Palmyra X5
Writer
69.1%
38
Gemini 2.5 Pro
Google
68.2%
39
MiniMax M2
MiniMax
68.1%
40
GLM 4.7
Z.AI
67.0%
41
Gemini 2.5 Flash Lite Preview 09-2025
Google
66.6%
42
Mistral Large 3 2512
Mistral
65.7%
43
Nemotron 3 Nano 30b A3b (free)
Nvidia
65.6%
44
GPT-5 Nano
OpenAI
62.1%
45
GPT-3.5 Turbo
OpenAI
61.3%
46
GPT-4.1 Mini
OpenAI
60.7%
47
Devstral 2 2512 (free)
Mistral
60.7%
48
Gemini 2.0 Flash Lite
Google
59.0%
49
GLM 4.7 Flash
Z.AI
57.9%
50
GPT-5 Mini
OpenAI
57.1%

About This Benchmark

Evaluation Method

Models are tested against DAX tasks of varying complexity using the Contoso sample dataset. Responses are evaluated for syntax correctness and output accuracy.

Scoring System

Harder tasks are worth more points. Correct solutions also earn bonus points for following DAX best practices, writing efficient code, and producing clear, readable output.

Task Categories

Tasks cover aggregation, time intelligence, filtering, calculations, iterators, and context transitions across basic, intermediate, and advanced levels.

Browse by Category

Browse All Tasks