MATH (Hendrycks)
MATH · Mathematical reasoning
MATH (Hendrycks) is a mathematical reasoning benchmark published in 2021 measuring 12,500 competition-math problems from AMC, AIME, etc. Evaluates step-by-step reasoning + final-answer accuracy. Contamination risk: medium.
What this benchmark measures
12,500 competition-math problems from AMC, AIME, etc. Evaluates step-by-step reasoning + final-answer accuracy.
Frontier reasoning models 90%+. AIME-2024 is the harder successor for unsaturated math eval.
Claimed scores
No claims have been recorded yet for this benchmark in the Policy Window catalog.
Interpretation guidance
Contamination risk: medium
Some test items may leak into training corpora; treat headline scores with mild skepticism and prefer evaluation runs with held-out subsets.
How to cite this benchmark
Use the primary methodology source for academic citations; reference the Policy Window article for the cross-model leaderboard.
- Primary methodology:https://arxiv.org/abs/2103.03874
- Wiki article:
https://policywindow.org/wiki/math-benchmark
Related benchmarks (mathematical reasoning)
- AIME 2024· 2024 · low contamination
- FrontierMath· 2024 · low contamination
References
Take this further — sign up free
Save, compare, or get alerts when MATH (Hendrycks) changes. Policy Window is the analyst workbench layered on top of this wiki — free for researchers, civil society, and verified policymakers.