MMLU

MMLU · General reasoning

Live · 2020

MMLU is a general reasoning benchmark published in 2020 measuring massive Multitask Language Understanding — 57-subject multiple-choice covering humanities, STEM, social sciences, professional/legal. Contamination risk: high.

What this benchmark measures

Massive Multitask Language Understanding — 57-subject multiple-choice covering humanities, STEM, social sciences, professional/legal.

Saturating — top models ~92%. Test-set leakage to training corpora is widely documented. MMLU-Pro is the harder successor.

Claimed scores

No claims have been recorded yet for this benchmark in the Policy Window catalog.

Interpretation guidance

Contamination risk: high

Test-set leakage is widely documented; scores near saturation should not be treated as evidence of generalization. Prefer harder successor benchmarks.

How to cite this benchmark

Use the primary methodology source for academic citations; reference the Policy Window article for the cross-model leaderboard.

Related benchmarks (general reasoning)

References

  1. MMLU methodology

Take this further — sign up free

Save, compare, or get alerts when MMLU changes. Policy Window is the analyst workbench layered on top of this wiki — free for researchers, civil society, and verified policymakers.

Generated from the Policy Window catalog at . Each claim cites the originating primary source.

Wiki articles regenerate when the underlying catalog updates. Tracked revisions arrive in a future iteration; subscribe via the CTA above to be notified when this article changes.