DAX LLM Benchmark

Which LLM is best at DAX?
DAXBench tests how models understand, write, and reason about DAX and Power BI.
Methodology designed by Maxim Anatsko.

Last updated: Mar 6, 2026

91 models · 30 tasks · Initial Release

Model Leaderboard

Ranked by score

Model
1
Gemini 3.1 Flash Lite PreviewHIGH
Google
97.4%
2
GPT-5.3 Chat
OpenAI
96.2%
3
GLM 5
Z.AI
96.2%
4
Gemini 3.1 Pro PreviewHIGH
Google
93.8%
5
Qwen3.5-FlashMED
Qwen
93.2%
6
Qwen3.5 397B A17B
Qwen
90.3%
7
Qwen3.5 Plus 2026-02-15MED
Qwen
89.7%
8
GPT-5.3-CodexHIGH
OpenAI
88.6%
9
gpt-oss-120b
OpenAI
85.6%
10
GPT-5.1-Codex-Max
OpenAI
85.0%
11
Claude Sonnet 4.6MED
Anthropic
84.5%
12
Gemini 3 Flash Preview
Google
83.8%
13
Gemini 2.5 Flash Preview 09-2025
Google
83.8%
14
GPT-5.4HIGH
OpenAI
83.2%
15
Claude Opus 4.5
Anthropic
82.7%
16
Claude Opus 4.6
Anthropic
82.0%
17
R1
DeepSeek
81.3%
18
Claude Sonnet 4
Anthropic
81.3%
19
Gemini 3 Pro Preview
Google
81.3%
20
o3
OpenAI
80.2%
21
GPT-5.2 Chat
OpenAI
80.1%
22
DeepSeek V3.2
DeepSeek
79.9%
23
GPT-5.2
OpenAI
78.4%
24
Kimi K2 Thinking
Moonshot AI
78.4%
25
Aurora Alpha
Openrouter
78.2%
26
Claude Sonnet 4.5
Anthropic
77.9%
27
Grok 4
xAI
76.7%
28
Gemini 2.0 Flash
Google
76.6%
29
Gemini 2.0 Flash Experimental (free)
Google
76.6%
30
o4 Mini
OpenAI
76.5%
31
Gemini 2.5 Flash
Google
76.0%
32
DeepSeek V3.1
DeepSeek
75.9%
33
DeepSeek V3.2 Speciale
DeepSeek
74.7%
34
Llama 4 Maverick
Meta
74.4%
35
GPT-4o-mini (2024-07-18)
OpenAI
74.1%
36
R1 0528
DeepSeek
73.7%
37
Grok Code Fast 1
xAI
73.7%
38
Nova Premier 1.0
Amazon
73.5%
39
DeepSeek V3 0324
DeepSeek
73.5%
40
KAT-Coder-Pro V1 (free)
Kwaipilot
72.8%
41
Mercury 2HIGH
Inception
71.9%
42
DeepSeek R1T2 Chimera (free)
TNG
71.8%
43
DeepSeek V3.1 Nex N1 (free)
Nex AGI
71.8%
44
Qwen3 Coder 480B A35B (free)
Qwen
71.8%
45
GPT-4o-mini
OpenAI
71.7%
46
GPT-5.1
OpenAI
71.3%
47
GPT-4o (2024-11-20)
OpenAI
70.0%
48
GPT-5.1-Codex-Mini
OpenAI
69.9%
49
Qwen3.5-122B-A10BMED
Qwen
69.4%
50
Palmyra X5
Writer
69.4%

About This Benchmark

Evaluation Method

Models are tested against DAX tasks of varying complexity using the Contoso sample dataset. Responses are evaluated for syntax correctness and output accuracy.

Scoring System

Harder tasks are worth more points. Correct solutions also earn bonus points for following DAX best practices, writing efficient code, and producing clear, readable output.

Task Categories

Tasks cover aggregation, time intelligence, filtering, calculations, iterators, and context transitions across basic, intermediate, and advanced levels.

Browse by Category

Browse All Tasks