DAX LLM Benchmark

Which LLM is best at DAX?
DAXBench tests how models understand, write, and reason about DAX and Power BI.
Methodology designed by Maxim Anatsko.

Last updated: Mar 6, 2026

91 models · 30 tasks · Initial Release

Model Leaderboard

Ranked by score

	Model				Tasks
1	Gemini 3.1 Flash Lite PreviewHIGH Google	97.4%	96.7%	100.0%	29/30
2	GPT-5.3 Chat OpenAI	96.2%	96.7%	100.0%	29/30
3	GLM 5 Z.AI	96.2%	96.7%	100.0%	29/30
4	Gemini 3.1 Pro PreviewHIGH Google	93.8%	93.3%	100.0%	28/30
5	Qwen3.5-FlashMED Qwen	93.2%	93.3%	100.0%	28/30
6	Qwen3.5 397B A17B Qwen	90.3%	90.0%	100.0%	27/30
7	Qwen3.5 Plus 2026-02-15MED Qwen	89.7%	90.0%	100.0%	27/30
8	GPT-5.3-CodexHIGH OpenAI	88.6%	86.7%	100.0%	26/30
9	gpt-oss-120b OpenAI	85.6%	83.3%	100.0%	25/30
10	GPT-5.1-Codex-Max OpenAI	85.0%	83.3%	100.0%	25/30
11	Claude Sonnet 4.6MED Anthropic	84.5%	83.3%	100.0%	25/30
12	Gemini 3 Flash Preview Google	83.8%	80.0%	100.0%	24/30
13	Gemini 2.5 Flash Preview 09-2025 Google	83.8%	80.0%	100.0%	24/30
14	GPT-5.4HIGH OpenAI	83.2%	80.0%	100.0%	24/30
15	Claude Opus 4.5 Anthropic	82.7%	80.0%	100.0%	24/30
16	Claude Opus 4.6 Anthropic	82.0%	80.0%	100.0%	24/30
17	R1 DeepSeek	81.3%	80.0%	100.0%	24/30
18	Claude Sonnet 4 Anthropic	81.3%	80.0%	100.0%	24/30
19	Gemini 3 Pro Preview Google	81.3%	80.0%	96.7%	24/30
20	o3 OpenAI	80.2%	80.0%	100.0%	24/30
21	GPT-5.2 Chat OpenAI	80.1%	80.0%	100.0%	24/30
22	DeepSeek V3.2 DeepSeek	79.9%	80.0%	100.0%	24/30
23	GPT-5.2 OpenAI	78.4%	76.7%	100.0%	23/30
24	Kimi K2 Thinking Moonshot AI	78.4%	76.7%	100.0%	23/30
25	Aurora Alpha Openrouter	78.2%	76.7%	100.0%	23/30
26	Claude Sonnet 4.5 Anthropic	77.9%	76.7%	100.0%	23/30
27	Grok 4 xAI	76.7%	73.3%	100.0%	22/30
28	Gemini 2.0 Flash Google	76.6%	73.3%	100.0%	22/30
29	Gemini 2.0 Flash Experimental (free) Google	76.6%	73.3%	100.0%	22/30
30	o4 Mini OpenAI	76.5%	76.7%	100.0%	23/30
31	Gemini 2.5 Flash Google	76.0%	73.3%	100.0%	22/30
32	DeepSeek V3.1 DeepSeek	75.9%	73.3%	100.0%	22/30
33	DeepSeek V3.2 Speciale DeepSeek	74.7%	73.3%	100.0%	22/30
34	Llama 4 Maverick Meta	74.4%	73.3%	96.7%	22/30
35	GPT-4o-mini (2024-07-18) OpenAI	74.1%	73.3%	100.0%	22/30
36	R1 0528 DeepSeek	73.7%	70.0%	100.0%	21/30
37	Grok Code Fast 1 xAI	73.7%	70.0%	100.0%	21/30
38	Nova Premier 1.0 Amazon	73.5%	70.0%	100.0%	21/30
39	DeepSeek V3 0324 DeepSeek	73.5%	70.0%	100.0%	21/30
40	KAT-Coder-Pro V1 (free) Kwaipilot	72.8%	73.3%	100.0%	22/30
41	Mercury 2HIGH Inception	71.9%	70.0%	100.0%	21/30
42	DeepSeek R1T2 Chimera (free) TNG	71.8%	70.0%	93.3%	21/30
43	DeepSeek V3.1 Nex N1 (free) Nex AGI	71.8%	70.0%	100.0%	21/30
44	Qwen3 Coder 480B A35B (free) Qwen	71.8%	70.0%	100.0%	21/30
45	GPT-4o-mini OpenAI	71.7%	70.0%	100.0%	21/30
46	GPT-5.1 OpenAI	71.3%	66.7%	100.0%	20/30
47	GPT-4o (2024-11-20) OpenAI	70.0%	70.0%	100.0%	21/30
48	GPT-5.1-Codex-Mini OpenAI	69.9%	70.0%	100.0%	21/30
49	Qwen3.5-122B-A10BMED Qwen	69.4%	66.7%	100.0%	20/30
50	Palmyra X5 Writer	69.4%	66.7%	100.0%	20/30

About This Benchmark

Evaluation Method

Models are tested against DAX tasks of varying complexity using the Contoso sample dataset. Responses are evaluated for syntax correctness and output accuracy.

Scoring System

Harder tasks are worth more points. Correct solutions also earn bonus points for following DAX best practices, writing efficient code, and producing clear, readable output.

Task Categories

Tasks cover aggregation, time intelligence, filtering, calculations, iterators, and context transitions across basic, intermediate, and advanced levels.

Browse by Category

Browse All Tasks

DAX LLM Benchmark

Model Leaderboard

Changelog

About This Benchmark

Evaluation Method

Scoring System

Task Categories

Browse by Category

Aggregation

Time Intelligence

Filtering

Calculation

Table Manipulation

Iterator

Context Transition