MMLU
MMLU · General reasoning
MMLU is a general reasoning benchmark published in 2020 measuring massive Multitask Language Understanding — 57-subject multiple-choice covering humanities, STEM, social sciences, professional/legal. Contamination risk: high.
What this benchmark measures
Massive Multitask Language Understanding — 57-subject multiple-choice covering humanities, STEM, social sciences, professional/legal.
Saturating — top models ~92%. Test-set leakage to training corpora is widely documented. MMLU-Pro is the harder successor.
Claimed scores
No claims have been recorded yet for this benchmark in the Policy Window catalog.
Interpretation guidance
Contamination risk: high
Test-set leakage is widely documented; scores near saturation should not be treated as evidence of generalization. Prefer harder successor benchmarks.
How to cite this benchmark
Use the primary methodology source for academic citations; reference the Policy Window article for the cross-model leaderboard.
- Primary methodology:https://arxiv.org/abs/2009.03300
- Wiki article:
https://policywindow.org/wiki/mmlu
Related benchmarks (general reasoning)
- MMLU-Pro· 2024 · medium contamination
- GPQA Diamond· 2023 · low contamination
- ARC-AGI v2· 2024 · low contamination
References
Take this further — sign up free
Save, compare, or get alerts when MMLU changes. Policy Window is the analyst workbench layered on top of this wiki — free for researchers, civil society, and verified policymakers.