MMLU-Pro
MMLU-PRO · General reasoning
MMLU-Pro is a general reasoning benchmark published in 2024 measuring successor to MMLU with 10-option multiple-choice (up from 4), more reasoning-focused tasks, and removed leaky / ambiguous items. Contamination risk: medium.
What this benchmark measures
Successor to MMLU with 10-option multiple-choice (up from 4), more reasoning-focused tasks, and removed leaky / ambiguous items.
Less saturated than MMLU. Frontier models ~70-80%.
Claimed scores
| Model | Score | Claim type | Reported | Citation |
|---|---|---|---|---|
| gemini-2.5-pro | 86.7 % accuracy | press release | 2025-05-20 | Google DeepMind announcement |
Interpretation guidance
Contamination risk: medium
Some test items may leak into training corpora; treat headline scores with mild skepticism and prefer evaluation runs with held-out subsets.
How to cite this benchmark
Use the primary methodology source for academic citations; reference the Policy Window article for the cross-model leaderboard.
- Primary methodology:https://arxiv.org/abs/2406.01574
- Wiki article:
https://policywindow.org/wiki/mmlu-pro
Related benchmarks (general reasoning)
- MMLU· 2020 · high contamination
- GPQA Diamond· 2023 · low contamination
- ARC-AGI v2· 2024 · low contamination
References
Take this further — sign up free
Save, compare, or get alerts when MMLU-Pro changes. Policy Window is the analyst workbench layered on top of this wiki — free for researchers, civil society, and verified policymakers.