DAX LLM Benchmark

Which LLM is best at DAX?
DAXBench tests how models understand, write, and reason about DAX and Power BI.
Methodology designed by Maxim Anatsko.

Browse all tasks

Complexity

Model Leaderboard

Ranked by score

	Model				Tasks
1	Gemini 2.5 Flash Preview 09-2025Google	79.7%	76.7%	100.0%	23/30
2	DeepSeek V3.2DeepSeek	79.4%	80.0%	100.0%	24/30
3	Claude Opus 4.5Anthropic	78.7%	76.7%	100.0%	23/30
4	gpt-oss-120bOpenAI	78.3%	76.7%	96.7%	23/30
5	Claude Sonnet 4.5Anthropic	77.5%	76.7%	100.0%	23/30
6	Gemini 2.0 FlashGoogle	76.2%	73.3%	100.0%	22/30
7	GPT-4o-mini (2024-07-18)OpenAI	73.6%	73.3%	100.0%	22/30
8	Gemini 2.5 FlashGoogle	72.0%	70.0%	100.0%	21/30
9	DeepSeek V3 0324DeepSeek	70.2%	70.0%	100.0%	21/30
10	Grok Code Fast 1xAI	69.7%	66.7%	100.0%	20/30
11	Mistral Large 3 2512Mistral	69.0%	66.7%	100.0%	20/30
12	Gemini 2.5 Flash Lite Preview 09-2025Google	66.6%	63.3%	100.0%	19/30
13	GPT-5.1-Codex-MiniOpenAI	65.3%	66.7%	93.3%	20/30
14	Gemini 2.5 ProGoogle	64.4%	63.3%	80.0%	19/30
15	GPT-5.1OpenAI	64.2%	60.0%	96.7%	18/30
16	DeepSeek R1T2 Chimera (free)TNG	63.3%	63.3%	80.0%	19/30
17	MiniMax M2MiniMax	61.4%	63.3%	73.3%	19/30
18	GPT-3.5 TurboOpenAI	61.3%	60.0%	100.0%	18/30
19	GPT-4.1 MiniOpenAI	60.7%	60.0%	100.0%	18/30
20	Kimi K2 ThinkingMoonshotAI	60.7%	63.3%	73.3%	19/30
21	Qwen3 Coder 480B A35B (free)Qwen	59.7%	60.0%	90.0%	18/30
22	Gemini 2.0 Flash LiteGoogle	59.0%	53.3%	100.0%	16/30
23	Gemini 3 Pro PreviewGoogle	58.5%	60.0%	70.0%	18/30
24	GPT-5 MiniOpenAI	57.1%	56.7%	100.0%	17/30
25	Claude Haiku 4.5Anthropic	57.0%	56.7%	100.0%	17/30
26	GPT-5 NanoOpenAI	53.7%	53.3%	83.3%	16/30
27	gpt-oss-20b (free)OpenAI	53.5%	53.3%	90.0%	16/30
28	Gemini 2.0 Flash Experimental (free)Google	51.3%	46.7%	63.3%	14/30
29	Gemma 3 27B (free)Google	46.6%	43.3%	100.0%	13/30
30	Nova 2 Lite (free)Amazon	46.2%	46.7%	93.3%	14/30
31	GLM 4.5 Air (free)Z.AI	45.4%	46.7%	66.7%	14/30
32	DeepSeek V3.2 SpecialeDeepSeek	38.7%	43.3%	53.3%	13/30
33	Phi 4Microsoft	20.2%	16.7%	56.7%	5/30

About This Benchmark

Evaluation Method

Models are tested against DAX tasks of varying complexity using the Contoso sample dataset. Responses are evaluated for syntax correctness and output accuracy.

Scoring System

Harder tasks are worth more points. Correct solutions also earn bonus points for following DAX best practices, writing efficient code, and producing clear, readable output.

Task Categories

Tasks cover aggregation, time intelligence, filtering, calculations, iterators, and context transitions across basic, intermediate, and advanced levels.