MMLU-Pro

MMLU-PRO · General reasoning

Live · 2024

MMLU-Pro is a general reasoning benchmark published in 2024 measuring successor to MMLU with 10-option multiple-choice (up from 4), more reasoning-focused tasks, and removed leaky / ambiguous items. Contamination risk: medium.

What this benchmark measures

Successor to MMLU with 10-option multiple-choice (up from 4), more reasoning-focused tasks, and removed leaky / ambiguous items.

Less saturated than MMLU. Frontier models ~70-80%.

Claimed scores

ModelScoreClaim typeReportedCitation
gemini-2.5-pro86.7 % accuracypress release2025-05-20Google DeepMind announcement

Interpretation guidance

Contamination risk: medium

Some test items may leak into training corpora; treat headline scores with mild skepticism and prefer evaluation runs with held-out subsets.

How to cite this benchmark

Use the primary methodology source for academic citations; reference the Policy Window article for the cross-model leaderboard.

Related benchmarks (general reasoning)

References

  1. MMLU-Pro methodology
  2. gemini-2.5-pro — 86.7 % accuracy (Google DeepMind announcement, 2025-05-20)

Take this further — sign up free

Save, compare, or get alerts when MMLU-Pro changes. Policy Window is the analyst workbench layered on top of this wiki — free for researchers, civil society, and verified policymakers.

Generated from the Policy Window catalog at . Each claim cites the originating primary source.

Wiki articles regenerate when the underlying catalog updates. Tracked revisions arrive in a future iteration; subscribe via the CTA above to be notified when this article changes.