MATH (Hendrycks)

MATH · Mathematical reasoning

Live · 2021

MATH (Hendrycks) is a mathematical reasoning benchmark published in 2021 measuring 12,500 competition-math problems from AMC, AIME, etc. Evaluates step-by-step reasoning + final-answer accuracy. Contamination risk: medium.

What this benchmark measures

12,500 competition-math problems from AMC, AIME, etc. Evaluates step-by-step reasoning + final-answer accuracy.

Frontier reasoning models 90%+. AIME-2024 is the harder successor for unsaturated math eval.

Claimed scores

No claims have been recorded yet for this benchmark in the Policy Window catalog.

Interpretation guidance

Contamination risk: medium

Some test items may leak into training corpora; treat headline scores with mild skepticism and prefer evaluation runs with held-out subsets.

How to cite this benchmark

Use the primary methodology source for academic citations; reference the Policy Window article for the cross-model leaderboard.

Related benchmarks (mathematical reasoning)

References

  1. MATH (Hendrycks) methodology

Take this further — sign up free

Save, compare, or get alerts when MATH (Hendrycks) changes. Policy Window is the analyst workbench layered on top of this wiki — free for researchers, civil society, and verified policymakers.

Generated from the Policy Window catalog at . Each claim cites the originating primary source.

Wiki articles regenerate when the underlying catalog updates. Tracked revisions arrive in a future iteration; subscribe via the CTA above to be notified when this article changes.