DAXBench logo

DAX LLM Benchmark

Which LLM is best at DAX?
DAXBench tests how models understand, write, and reason about DAX and Power BI.
Methodology designed by Maxim Anatsko.

Last updated: Dec 6, 2025

33 models · 30 tasks· Initial Release

Model Leaderboard

Ranked by score

ModelTasks
1
Gemini 2.5 Flash Preview 09-2025Google
79.7%
76.7%100.0%23/30
2
DeepSeek V3.2DeepSeek
79.4%
80.0%100.0%24/30
3
Claude Opus 4.5Anthropic
78.7%
76.7%100.0%23/30
4
gpt-oss-120bOpenAI
78.3%
76.7%96.7%23/30
5
Claude Sonnet 4.5Anthropic
77.5%
76.7%100.0%23/30
6
Gemini 2.0 FlashGoogle
76.2%
73.3%100.0%22/30
7
GPT-4o-mini (2024-07-18)OpenAI
73.6%
73.3%100.0%22/30
8
Gemini 2.5 FlashGoogle
72.0%
70.0%100.0%21/30
9
DeepSeek V3 0324DeepSeek
70.2%
70.0%100.0%21/30
10
Grok Code Fast 1xAI
69.7%
66.7%100.0%20/30
11
Mistral Large 3 2512Mistral
69.0%
66.7%100.0%20/30
12
Gemini 2.5 Flash Lite Preview 09-2025Google
66.6%
63.3%100.0%19/30
13
GPT-5.1-Codex-MiniOpenAI
65.3%
66.7%93.3%20/30
14
Gemini 2.5 ProGoogle
64.4%
63.3%80.0%19/30
15
GPT-5.1OpenAI
64.2%
60.0%96.7%18/30
16
DeepSeek R1T2 Chimera (free)TNG
63.3%
63.3%80.0%19/30
17
MiniMax M2MiniMax
61.4%
63.3%73.3%19/30
18
GPT-3.5 TurboOpenAI
61.3%
60.0%100.0%18/30
19
GPT-4.1 MiniOpenAI
60.7%
60.0%100.0%18/30
20
Kimi K2 ThinkingMoonshotAI
60.7%
63.3%73.3%19/30
21
Qwen3 Coder 480B A35B (free)Qwen
59.7%
60.0%90.0%18/30
22
Gemini 2.0 Flash LiteGoogle
59.0%
53.3%100.0%16/30
23
Gemini 3 Pro PreviewGoogle
58.5%
60.0%70.0%18/30
24
GPT-5 MiniOpenAI
57.1%
56.7%100.0%17/30
25
Claude Haiku 4.5Anthropic
57.0%
56.7%100.0%17/30
26
GPT-5 NanoOpenAI
53.7%
53.3%83.3%16/30
27
gpt-oss-20b (free)OpenAI
53.5%
53.3%90.0%16/30
28
Gemini 2.0 Flash Experimental (free)Google
51.3%
46.7%63.3%14/30
29
Gemma 3 27B (free)Google
46.6%
43.3%100.0%13/30
30
Nova 2 Lite (free)Amazon
46.2%
46.7%93.3%14/30
31
GLM 4.5 Air (free)Z.AI
45.4%
46.7%66.7%14/30
32
DeepSeek V3.2 SpecialeDeepSeek
38.7%
43.3%53.3%13/30
33
Phi 4Microsoft
20.2%
16.7%56.7%5/30

About This Benchmark

Evaluation Method

Models are tested against DAX tasks of varying complexity using the Contoso sample dataset. Responses are evaluated for syntax correctness and output accuracy.

Scoring System

Harder tasks are worth more points. Correct solutions also earn bonus points for following DAX best practices, writing efficient code, and producing clear, readable output.

Task Categories

Tasks cover aggregation, time intelligence, filtering, calculations, iterators, and context transitions across basic, intermediate, and advanced levels.