MMLU-Pro

Policy Window Editorial Board

MMLU-Pro

MMLU-PRO · General reasoning

Live · 2024

Tools

Last verified 2026-06-21

Cite Share PDF

What it measures

Successor to MMLU with 10-option multiple-choice (up from 4), more reasoning-focused tasks, and removed leaky / ambiguous items.

Less saturated than MMLU. Frontier models ~70-80%.

Construct & what it actually measures

MMLU-Pro is presented as a measure of broad, reasoning-intensive subject mastery, but its construct differs from its predecessor in ways that shape how scores should be read. The benchmark comprises 12,032 questions across 14 disciplines, and the headline change is the expansion of the answer set from four to ten options (average 9.47 options per item; 83% carry the full ten), with additional distractors generated by GPT-4-Turbo and then filtered by a panel of more than ten domain experts ¹. Mechanically, ten plausible distractors lower the random-guessing floor from 25% to roughly 10% and reduce the headroom that elimination heuristics provide, so a given accuracy reflects more discrimination than the same number on MMLU.

The sharpest construct signal is that chain-of-thought (CoT) prompting *raises* MMLU-Pro accuracy relative to direct answering, reversing the pattern observed on the original MMLU, where CoT often did not help ¹. The authors read this as evidence that MMLU-Pro items demand multi-step reasoning rather than fact recall — the kind of broad, adaptable competence that the foundation-model framing treats as the object of measurement ². The corollary, important for governance readers, is that a reported MMLU-Pro number is partly a measure of the *scaffolding* (CoT, self-consistency, reasoning-mode toggles) as much as the underlying model; a score is a model-plus-protocol artifact, not a pure capability constant. This is a composite editorial reading of the paper's own ablations, not a claim in the paper.

Saturation & score trajectory

MMLU-Pro was introduced explicitly to restore headroom that the original MMLU had lost, and its early scores reflect that: the strongest model in the introducing paper, GPT-4o, reached 72.6% overall, with a stated 16–33 percentage-point accuracy drop relative to MMLU ¹. That gap has since closed substantially. Public aggregator leaderboards place 2025-era frontier systems near 90% — for example Gemini 3 Pro Preview at 89.8% and Claude Opus 4.5 (reasoning mode) at 89.5% (Artificial Analysis, accessed June 2026). The trajectory from a low-70s ceiling at release toward the high-80s within roughly eighteen months is consistent with the empirical scaling relation that test loss falls as a power law in model size, data, and compute ³, and indicates that MMLU-Pro, like MMLU before it, is approaching saturation for the top tier.

Two cautions attach to any such table. First, the same model is reported at materially different MMLU-Pro scores across sources (GPT-4o appears variously near 72–77% depending on harness and prompt), so cross-source point comparisons carry several points of slack; trajectory dates below should be read as the model's public-availability period, not the leaderboard-entry date ¹. Second, as headroom shrinks the marginal information in a one- or two-point gain falls, which is the standard saturation signal that score differences stop tracking meaningful capability differences — a problem compounded because some capability gains appear only above scale thresholds rather than smoothly ⁴. The dating and figures are composite from the cited primary paper and the named leaderboard.

Critiques & limitations

MMLU-Pro materially hardened its predecessor on robustness: across 24 prompt templates, score sensitivity to prompt phrasing fell from 4–5% on MMLU to about 2% on MMLU-Pro, and trivial or mislabeled items flagged in expert review were removed ¹. Those are genuine improvements, but several limitations remain documented. First, the multiple-choice paradigm itself leaves room for position and shortcut exploitation: dedicated work on the original MMLU found that shuffling answer order alone dropped accuracy by 6.2 to 27.2 percentage points across ten models, evidence that systems can lean on answer-position regularities rather than reasoning ⁵. That study targets MMLU, not MMLU-Pro, so it bounds an inherited risk class rather than measuring MMLU-Pro directly — a distinction worth preserving.

Second, the MMLU-Pro+ extension showed that even on the harder set, models exhibit measurable anchoring bias: by introducing items with more than one correct option and a 'shortcut selection ratio', the authors exposed varying degrees of shortcut learning across six frontier models, indicating that high MMLU-Pro scores can coexist with brittle higher-order reasoning ⁶. Third, the GPT-4-Turbo-generated distractors introduce a model-in-the-loop construction dependency — the homogenization concern that defects of a foundation model are inherited by what is built on it ² — and residual label noise from the underlying MMLU source is reduced but not eliminated by expert review. The net editorial reading: MMLU-Pro is more discriminating and more prompt-stable than MMLU, yet remains a multiple-choice instrument whose headline numbers can overstate robustness of reasoning.

Results & interpretation

Claimed scores

Model	Score	Claim type	Reported	Citation
gemini-2.5-pro	86.7 % accuracy	press release	2025-05-20	Google DeepMind announcement

How to read this number

Contamination risk: medium

Some test items may leak into training corpora; treat headline scores with mild skepticism and prefer evaluation runs with held-out subsets.

What a high score does and does not establish. A score evidences performance on this benchmark’s specific construct under its specific format; it is not, on its own, evidence of general capability, reliable real-world task performance, or safety.

The second silence. evidence: thin The evidence that a benchmark score predicts real-world deployment outcomes (construct-to-deployment validity) is sparse; benchmark performance and deployed performance are not established to be the same thing, and contamination can inflate the headline figure above true held-out ability.

Governance relevance

A benchmark measures a capability; governance attaches to the topicsthat capability bears on. These topic articles carry the instrument×dimension coverage matrix and the social-science so-what for this domain.

Foundation Models / GPAI— coverage matrix + does-governance-work evidence

References

Sources cited inline in the analysis (linked from the superscript markers), then the primary instrument sources behind the classifications.

arXiv:2406.01574 ↩
Bommasani et al. (2021) On the Opportunities and Risks of Foundation Models, arXiv. arXiv:2108.07258 — Defines foundation models and warns homogenization "demands caution, as the defects of the foundation model are inherited by all the adapted models downstream". ↩
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, Dario Amodei (2020) Scaling Laws for Neural Language Models, arXiv (cs.LG). arXiv:2001.08361 — Establishes that model 'loss scales as a power-law with model size, dataset size, and the amount of compute', the empirical basis for compute-threshold regulation of foundation models. ↩
Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, et al. (2022) Emergent Abilities of Large Language Models, arXiv (cs.CL) / TMLR. arXiv:2206.07682 — Documents 'emergent abilities' that appear only above a scale threshold and 'would not have been directly predicted by extrapolating' smaller models — a core governance unpredictability problem. ↩
arXiv:2406.19470 ↩
arXiv:2409.02257 ↩
gemini-2.5-pro — 86.7 % accuracy (Google DeepMind announcement, 2025-05-20)

How to cite this benchmark

Use the primary methodology source for academic citations; reference the Policy Window article for the cross-model leaderboard.

Cite this article 8 formats · BibTeX, RIS, APA, Chicago, … · 1-click copy

@misc{policywindow-mmlu-pro,
  title  = {MMLU-Pro},
  author = {Policy Window},
  year   = {2024},
  howpublished = {MMLU-PRO (2024)},
  url    = {https://policywindow.org/wiki/mmlu-pro},
  note   = {Primary source: https://arxiv.org/abs/2406.01574}
}

Verify the year + paste-and-refine. Primary source linked in BibTeX/RIS note.

Permalink downloads.bib .ris .csl.json

Persistent identifier: https://policywindow.org/wiki/mmlu-pro — committed-stable URL with content-versioning via ?asOf= (rollout pending per methodology §7). DOIs via Zenodo are on the roadmap.

Article tools — track changes, suggest an edit

View history — every captured revision of this article · What links here

Source: Edit on GitHub (search for `MMLU-PRO`)

[ref-1] arXiv:2406.01574 ↩

[ref-2] Bommasani et al. (2021) On the Opportunities and Risks of Foundation Models, arXiv. arXiv:2108.07258 — Defines foundation models and warns homogenization "demands caution, as the defects of the foundation model are inherited by all the adapted models downstream". ↩

[ref-3] Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, Dario Amodei (2020) Scaling Laws for Neural Language Models, arXiv (cs.LG). arXiv:2001.08361 — Establishes that model 'loss scales as a power-law with model size, dataset size, and the amount of compute', the empirical basis for compute-threshold regulation of foundation models. ↩

[ref-4] Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, et al. (2022) Emergent Abilities of Large Language Models, arXiv (cs.CL) / TMLR. arXiv:2206.07682 — Documents 'emergent abilities' that appear only above a scale threshold and 'would not have been directly predicted by extrapolating' smaller models — a core governance unpredictability problem. ↩

[ref-5] arXiv:2406.19470 ↩

[ref-6] arXiv:2409.02257 ↩

[ref-7] gemini-2.5-pro — 86.7 % accuracy (Google DeepMind announcement, 2025-05-20)

MMLU-Pro

What it measures

Construct & what it actually measures

Saturation & score trajectory

Critiques & limitations

Results & interpretation

Claimed scores

How to read this number

Governance relevance

See also

Further reading

References

How to cite this benchmark

MMLU-Pro

What it measures

Construct & what it actually measures

Saturation & score trajectory

Critiques & limitations

Results & interpretation

Claimed scores

How to read this number

Governance relevance

See also

Further reading

References

How to cite this benchmark

What it measures

Construct & what it actually measures

Saturation & score trajectory

Critiques & limitations

Results & interpretation

Claimed scores

How to read this number

Governance relevance

See also

Related benchmarks (general reasoning)

Further reading

References

How to cite this benchmark

What it measures

Construct & what it actually measures

Saturation & score trajectory

Critiques & limitations

Results & interpretation

Claimed scores

How to read this number

Governance relevance

See also

Related benchmarks (general reasoning)

Further reading

References

How to cite this benchmark