MMLU

Policy Window Editorial Board

MMLU

MMLU · General reasoning

Saturated · 2020

Tools

Last verified 2026-06-21

Cite Share PDF

Explainer

MMLU (Massive Multitask Language Understanding) is a knowledge-and-reasoning benchmark that measures a model's accuracy on multiple-choice questions drawn from 57 subjects spanning the humanities, STEM, the social sciences, and professional and legal domains. Each item presents a question with answer options, and a model's score is reported as percent accuracy on a 0–100 scale. In Policy Window's catalog, MMLU sits in the general-reasoning domain, reflecting its design intent: rather than probing a single skill, it samples broadly across fields a knowledgeable generalist might be expected to handle, from elementary mathematics and US history to law and clinical medicine.

The benchmark originated with Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt in "Measuring Massive Multitask Language Understanding," released in 2020 (arXiv:2009.03300) and presented at ICLR 2021 (https://arxiv.org/abs/2009.03300). The authors framed it as a test of what they describe as requiring extensive "world knowledge and problem solving ability" acquired largely during pretraining, evaluated in zero- and few-shot settings rather than after task-specific fine-tuning. Over the following years MMLU became the de facto default for reporting general capability: it appeared in nearly every frontier model release and system card, which made it a convenient common yardstick precisely because so many labs reported it.

That ubiquity is now also its central limitation. MMLU is, in the catalog's terms, saturated: leading frontier models cluster in roughly the high-80s to low-90s percent accuracy, and small gaps near the ceiling no longer reliably separate frontier systems from competent mid-tier ones. Two distinct problems compound the saturation. First, the benchmark has a high contamination risk: because MMLU questions are widely published on the open web, they can leak into the very pretraining corpora used to build the models that are later evaluated on them, which inflates measured scores in ways that are difficult to disentangle from genuine capability. This concern is well documented in the evaluation literature — Sainz et al., "NLP Evaluation in trouble: On the Need to Measure LLM Data Contamination for each Benchmark" (Findings of EMNLP 2023, https://aclanthology.org/2023.findings-emnlp.722/), argues that test-set contamination can undermine the validity of benchmark results and calls for routine contamination auditing. Second, even setting contamination aside, a four-option multiple-choice format with a saturated ceiling has limited headroom to discriminate among the strongest models; the four-option format also permits a 25 percent chance baseline and measures selection from given options rather than open-ended generation, so a model can identify the right answer without being able to produce it.

These pressures motivated a harder successor. MMLU-Pro — Wang, Ma, Zhang, and colleagues, "MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark" (2024, arXiv:2406.01574, https://arxiv.org/abs/2406.01574), published at the NeurIPS 2024 Datasets and Benchmarks Track — expands the answer set from four to ten options, adds more reasoning-intensive items, and removes trivial and erroneous questions. The authors report that these changes drop accuracy by roughly 16 to 33 percent relative to MMLU while improving stability across prompt variations, restoring discriminative headroom. Policy Window records MMLU-PRO (slug mmlu-pro) as MMLU's catalog successor.

What an MMLU score does and does not establish is worth stating plainly. A high score evidences broad recall of factual and academic knowledge under a constrained multiple-choice format; it is not, on its own, evidence of open-ended reasoning, reliable tool use, or agentic competence in realistic tasks, and contamination can inflate the headline number relative to true held-out ability. For governance and procurement readers, MMLU is therefore best read as one coarse, increasingly saturated signal among many rather than a summary verdict on capability. Policy Window holds this editorial read provisionally: as contamination-auditing methods and successor benchmarks mature, the appropriate weight to place on MMLU may shift.

This benchmark is saturated — for frontier evaluation, consult MMLU-Pro.

What it measures

Massive Multitask Language Understanding — 57-subject multiple-choice covering humanities, STEM, social sciences, professional/legal.

Saturating — top models ~92%. Test-set leakage to training corpora is widely documented. MMLU-Pro is the harder successor. Currency (2026-06-21): Verified current. MMLU still saturated with top scores around 90 to 92 percent (GLM 5 about 91.7), matching the article 92 percent band and the saturated and high classifications. Gema et al label-error figures and the MMLU-Pro and MMLU-CF successor framing are confirmed. Only minor non-material additions exist (2026 contamination dose-response work, multilingual MMLU-ProX and IndicMMLU-Pro variants).

Saturation and score trajectory

MMLU's score history traces an unusually steep climb from near-random to near-ceiling in roughly four years, which is itself the strongest evidence for the catalog's "saturated" classification. At release, the largest model the authors evaluated — the 175-billion-parameter GPT-3 — improved over the four-option random-chance baseline of 25% "by almost 20 percentage points on average" (i.e. into the low-40s), while most other models they tested sat at "near random-chance accuracy" ¹. Within two-and-a-half years the frontier had advanced to 86.4% (5-shot), the figure OpenAI reported for GPT-4 ². Such jumps partly reflect that some capabilities surface only above a scale threshold and "would not have been directly predicted by extrapolating" smaller models ³, which makes a fixed test's discriminative life-span hard to forecast. Subsequent frontier releases compressed into the high-80s to low-90s, the band the lede already notes.

The governance implication of this trajectory is discrimination loss, not merely high scores. Once a leaderboard's leading entries are separated by a point or two against a fixed test of 14,000-odd items, ordinary sampling noise and the dataset's own item errors (see below) can exceed the gaps being reported, so headline differences stop carrying reliable signal about relative capability. This is the mechanism behind the recommendation, recorded on this page, to prefer harder successors: MMLU-Pro restored headroom by cutting accuracy 16–33% relative to MMLU ⁴. The dated points below are drawn from the cited primary reports rather than aggregated leaderboards.

Label errors and item-quality critiques

A distinct line of criticism targets MMLU's internal quality rather than its saturation or leakage: a non-trivial share of its items are simply defective. Gema et al., "Are We Done with MMLU?" ⁵, had 14 expert annotators re-examine MMLU and estimate that 6.49% of questions contain errors, with the defect rate varying enormously by subject — the authors single out the Virology subset, where 57% of analysed questions contained errors. They organise the defects with an error taxonomy spanning two families: question assessment (e.g. Bad Question Clarity, Bad Options Clarity) and ground-truth verification (No Correct Answer, Multiple Correct Answers, Wrong Ground Truth). To support re-evaluation they release MMLU-Redux, a manually re-annotated subset of 5,700 questions across all 57 subjects ⁵.

The consequence is that some of the residual gap between near-ceiling models is being scored against unanswerable or mis-keyed items, so a "wrong" response may be the defensible one. Re-evaluating contemporary models on the corrected subset produced "significant discrepancies" from originally reported metrics and shifts in model rankings ⁵. For a procurement or governance reader, this compounds the saturation problem in a specific way: when the spread between candidate systems is a few points, an irreducible several-percent error floor in the reference itself means small leaderboard differences cannot be attributed confidently to capability. It also matters because a defective reference does not stay contained: where one benchmark anchors many downstream deployment decisions, its flaws propagate — the foundation-model literature warns that "the defects of the foundation model are inherited by all the adapted models downstream" ⁶, and the analogous risk for a shared evaluation standard is that mis-keyed items quietly distort the comparisons built on top of it. Benchmark authority therefore rests on annotation quality, not only on the construct's 57-subject coverage breadth.

Contamination, gaming, and contamination-resistant variants

Because MMLU items are openly published, the same questions can enter the web-scraped corpora used to pretrain the models later graded on them — so a high score can reflect memorisation rather than the generalisation the benchmark is taken to measure. The page already cites Sainz et al. for the general problem; the concern that large models "memorize and leak pieces of training data" is itself well documented in the foundation-model literature ⁷, and the quantitative case against MMLU specifically has since sharpened. Microsoft's MMLU-CF ⁸ rebuilds a comparable multitask test under three explicit decontamination rules and, critically, keeps the test split closed-source while releasing only a validation split, precisely so that future training runs cannot ingest the answers. On this contamination-controlled set GPT-4o scores 73.4% (5-shot) and 71.9% (0-shot) — well below the high-80s/low-90s the same model reports on the original public MMLU, a gap the authors attribute to the original's exposure to leakage ⁸.

This closed-test design is the structural reason such variants exist: a benchmark whose items are public has, by construction, a finite shelf life as a contamination-resistant measure. The same pressure motivates the harder, larger-option MMLU-Pro, which additionally reduced prompt-format sensitivity from 4–5% on MMLU to about 2% ⁴ — relevant because format gaming (answer-position bias, prompt-template tuning) is another way a public four-option set can be optimised without underlying capability gains. For governance use, the operational takeaway is that an MMLU figure cited in a system card cannot be assumed contamination-free, and a held-out or closed-test variant is the appropriate cross-check before treating the number as evidence of generalisation.

Results & interpretation

Claimed scores

No claims have been recorded yet for this benchmark in the Policy Window catalog.

How to read this number

Contamination risk: high

Test-set leakage is widely documented; scores near saturation should not be treated as evidence of generalization. Prefer harder successor benchmarks.

What a high score does and does not establish. A score evidences performance on this benchmark’s specific construct under its specific format; it is not, on its own, evidence of general capability, reliable real-world task performance, or safety. This benchmark is saturated, so small differences near the ceiling no longer reliably separate frontier from mid-tier systems.

The second silence. evidence: thin The evidence that a benchmark score predicts real-world deployment outcomes (construct-to-deployment validity) is sparse; benchmark performance and deployed performance are not established to be the same thing, and contamination can inflate the headline figure above true held-out ability.

Governance relevance

A benchmark measures a capability; governance attaches to the topicsthat capability bears on. These topic articles carry the instrument×dimension coverage matrix and the social-science so-what for this domain.

Foundation Models / GPAI— coverage matrix + does-governance-work evidence

References

Sources cited inline in the analysis, numbered in order of appearance.

arXiv:2009.03300 ↩
arXiv:2303.08774 ↩
Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, et al. (2022) Emergent Abilities of Large Language Models, arXiv (cs.CL) / TMLR. arXiv:2206.07682 — Documents 'emergent abilities' that appear only above a scale threshold and 'would not have been directly predicted by extrapolating' smaller models — a core governance unpredictability problem. ↩
arXiv:2406.01574 ↩
arXiv:2406.04127 ↩
Bommasani et al. (2021) On the Opportunities and Risks of Foundation Models, arXiv. arXiv:2108.07258 — Defines foundation models and warns homogenization "demands caution, as the defects of the foundation model are inherited by all the adapted models downstream". ↩
Hannah Ruschemeier (2025) Generative AI and data protection, Cambridge Forum on AI: Law and Governance. 10.1017/cfl.2024.2 — Examines friction between foundation-model training and the GDPR, noting models that 'memorize and leak pieces of training data' cannot be treated as anonymous. ↩
arXiv:2412.15194 ↩

How to cite this benchmark

Use the primary methodology source for academic citations; reference the Policy Window article for the cross-model leaderboard.

Cite this article 8 formats · BibTeX, RIS, APA, Chicago, … · 1-click copy

@misc{policywindow-mmlu,
  title  = {MMLU},
  author = {Policy Window},
  year   = {2020},
  howpublished = {MMLU (2020)},
  url    = {https://policywindow.org/wiki/mmlu},
  note   = {Primary source: https://arxiv.org/abs/2009.03300}
}

Verify the year + paste-and-refine. Primary source linked in BibTeX/RIS note.

Permalink downloads.bib .ris .csl.json

Persistent identifier: https://policywindow.org/wiki/mmlu — committed-stable URL with content-versioning via ?asOf= (rollout pending per methodology §7). DOIs via Zenodo are on the roadmap.

Article tools — track changes, suggest an edit

View history — every captured revision of this article · What links here

Source: Edit on GitHub (search for `MMLU`)

[ref-1] arXiv:2009.03300 ↩

[ref-2] arXiv:2303.08774 ↩

[ref-3] Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, et al. (2022) Emergent Abilities of Large Language Models, arXiv (cs.CL) / TMLR. arXiv:2206.07682 — Documents 'emergent abilities' that appear only above a scale threshold and 'would not have been directly predicted by extrapolating' smaller models — a core governance unpredictability problem. ↩

[ref-4] arXiv:2406.01574 ↩

[ref-5] arXiv:2406.04127 ↩

[ref-6] Bommasani et al. (2021) On the Opportunities and Risks of Foundation Models, arXiv. arXiv:2108.07258 — Defines foundation models and warns homogenization "demands caution, as the defects of the foundation model are inherited by all the adapted models downstream". ↩

[ref-7] Hannah Ruschemeier (2025) Generative AI and data protection, Cambridge Forum on AI: Law and Governance. 10.1017/cfl.2024.2 — Examines friction between foundation-model training and the GDPR, noting models that 'memorize and leak pieces of training data' cannot be treated as anonymous. ↩

[ref-8] arXiv:2412.15194 ↩

MMLU

What it measures

Saturation and score trajectory

Label errors and item-quality critiques

Contamination, gaming, and contamination-resistant variants

Results & interpretation

Claimed scores

How to read this number

Governance relevance

See also

Further reading

References

How to cite this benchmark

MMLU

What it measures

Saturation and score trajectory

Label errors and item-quality critiques

Contamination, gaming, and contamination-resistant variants

Results & interpretation

Claimed scores

How to read this number

Governance relevance

See also

Further reading

References

How to cite this benchmark

What it measures

Saturation and score trajectory

Label errors and item-quality critiques

Contamination, gaming, and contamination-resistant variants

Results & interpretation

Claimed scores

How to read this number

Governance relevance

See also

Related benchmarks (general reasoning)

Further reading

References

How to cite this benchmark

What it measures

Saturation and score trajectory

Label errors and item-quality critiques

Contamination, gaming, and contamination-resistant variants

Results & interpretation

Claimed scores

How to read this number

Governance relevance

See also

Related benchmarks (general reasoning)

Further reading

References

How to cite this benchmark