MASE Business AI Benchmark

February 2026 Leaderboard

Overall MASE Score (Quality × Consistency)

Last updated: Feb 7, 2026

Rank	Model	Provider	MASE Score	Cost/Task	Speed	Best For
1	Claude Opus 4.6	Anthropic	8.72	$0.050	7.3s	Complex reasoning, agentic tasks
2	GPT-5.2	OpenAI	8.64	$0.035	3.4s	Balanced performance, speed
3	GPT-5.2 Pro	OpenAI	8.58	$0.210	12.8s	Maximum precision tasks
4	Gemini 3 Pro	Google	8.31	$0.040	4.8s	Multimodal, document processing
5	Claude Sonnet 4.5	Anthropic	8.19	$0.030	4.2s	Best value for quality
6	Llama 4 Maverick	Meta	7.84	$0.020	5.6s	Self-hosted, cost-sensitive
7	Mistral Large 3	Mistral	7.71	$0.030	5.1s	European deployment
8	Gemini 3 Flash	Google	7.42	$0.015	2.1s	High-volume, speed-critical

View detailed results with category breakdowns →

Performance by Category

How each model performs across 7 business task categories

Compare Models

Select 2-3 models for side-by-side comparison

Model 1

Model 2

Model 3 (optional)

Cost-Adjusted Rankings

MASE-Cost Score = Quality per dollar spent (higher = better value)

Best Value

Gemini 3 Flash

$0.015/task • 7.42 quality

4,942

Llama 4 Maverick

$0.020/task • 7.84 quality

3,920

Claude Sonnet 4.5

$0.030/task • 8.19 quality

2,730

Best Quality

Claude Opus 4.6

$0.050/task • 9.2 consistency

8.72

GPT-5.2

$0.035/task • 8.8 consistency

8.64

GPT-5.2 Pro

$0.210/task • 7.9 consistency

8.58

The Verdict

Claude Opus 4.6 leads on raw quality, but Claude Sonnet 4.5 and Gemini 3 Flash offer the best return on investment for most business applications.

The gap between frontier and efficiency tiers has narrowed dramatically—you're no longer paying 10x for 10% better results.

Resources

Full Results

850+ lines of detailed analysis

Methodology

How we test and score

Download Data

JSON & CSV formats

Submit Model

Get your model tested

Need Help Choosing?

We can analyze your specific use cases and recommend the right model mix for your business.

Talk to Us

MASE Business AIBenchmark