MATH (Hendrycks)

Policy Window Editorial Board

MATH (Hendrycks)

MATH · Mathematical reasoning

Saturated · 2021

Tools

Last verified 2026-06-21

Cite Share PDF

This benchmark is saturated — for frontier evaluation, consult AIME 2024.

What it measures

12,500 competition-math problems from AMC, AIME, etc. Evaluates step-by-step reasoning + final-answer accuracy.

Frontier reasoning models 90%+. AIME-2024 is the harder successor for unsaturated math eval. Currency (2026-06-21): MATH/MATH-500 is now even more thoroughly saturated than the article's latest cited data point (OpenAI o1, 94.8%, 2024) — current frontier models cluster at ~99% on MATH-500 (e.g. GPT-5 99.4%, o3 99.2%, LongCat-Flash-Thinking 99.2% per Artificial Analysis/llm-stats leaderboards), reinforcing (not contradicting) the article's saturation thesis; optional enrichment would add a post-2024 ceiling row, but no existing claim is stale.

Saturation and score trajectory

When MATH was released, it was a deliberately hard target: across the large language models tested in 2021, accuracy ranged only from 3.0% to 6.9%, and the authors observed that "accuracy remains relatively low, even with enormous Transformer models," warning that "simply increasing budgets and model parameter counts will be impractical for achieving strong mathematical reasoning if scaling trends continue" ¹. That forecast was overtaken within roughly eighteen months. Minerva, a PaLM model further trained on mathematical and scientific text, reached 33.6% on MATH with greedy decoding and 50.3% using majority voting over many samples, against a quoted prior published result of 6.9% ². Process-supervised reward modeling pushed a representative MATH subset to 78% ³, and OpenAI's o1 reported 94.8% on MATH under 0-shot chain-of-thought prompting (OpenAI 2024, "Learning to reason with LLMs"). The benchmark is now widely treated as saturated for frontier systems: scores cluster near the 90% human reference set by a three-time IMO gold medalist (Hendrycks et al. 2021), so small differences no longer reliably separate frontier from mid-tier models — the reason the Policy Window catalog routes frontier mathematical evaluation to AIME 2024 and FrontierMath instead. The progression below is a composite drawn from the cited primary reports; figures use differing prompting and sampling protocols and are not strictly like-for-like.

Reported MATH (Hendrycks) accuracy over time. Protocols differ across rows (prompting, sampling, and in one case a 500-problem subset) and are not strictly comparable; entries are drawn from the cited primary sources.
Model / system	MATH accuracy	Year	Source
Large LMs (incl. GPT-2/GPT-3 class)	3.0%–6.9%	2021	Hendrycks et al., arXiv:2103.03874
Minerva 540B (greedy)	33.6%	2022	Lewkowycz et al., arXiv:2206.14858
Minerva 540B (majority vote)	50.3%	2022	Lewkowycz et al., arXiv:2206.14858
Process-reward model (MATH-500 subset)	78%	2023	Lightman et al., arXiv:2305.20050
OpenAI o1 (0-shot CoT)	94.8%	2024	OpenAI, "Learning to reason with LLMs"
Human reference (IMO gold medalist)	~90%	2021	Hendrycks et al., arXiv:2103.03874

Contamination and gaming

MATH carries a documented contamination exposure because its 12,500 problems are drawn from public competition sources (AMC, AIME, and similar) whose problems and worked solutions circulate widely on the open web that pretraining corpora ingest ¹. The risk is documented at the survey level: contamination of math reasoning benchmarks by web-scale pretraining corpora is a recognised and recurring problem that complicates treating headline figures as held-out generalization ⁴. Standard string- and n-gram-based decontamination is moreover insufficient: Yang et al. ⁵ show that paraphrased or translated test items evade conventional filters, letting a 13B model "easily overfit a test benchmark and achieve drastically high performance, on par with GPT-4," and propose an LLM-based detector in response. Quantifying the inflation, inference-time decontamination reduced measured GSM8K accuracy by 22.9% and MMLU by 19.0% once leaked items were rewritten ⁶. These pressures are the explicit rationale for held-out and curated variants: OpenAI's MATH-500, a 500-problem held-out subset used for process-supervision evaluation, exists precisely so that scoring is not done over items whose training status is uncertain ³. For governance use, this means a high MATH number should be read as an upper bound that may embed memorization rather than a clean measure of reasoning.

Critiques and limitations

Beyond contamination, MATH has structural measurement limits. Its scoring checks only the final extracted answer, not the validity of the intermediate reasoning, so a model can reach the right number through flawed or lucky steps and a correct chain can be marked wrong on a formatting mismatch ^1,3. Answer-only grading also introduces extraction and format sensitivity: equivalent forms (a fraction versus a decimal, an unsimplified versus simplified radical, ordering of a solution set) can be scored as failures unless the harness normalizes them, a source of grading noise that later math benchmarks explicitly redesigned away from ⁷. Because competition problems were repurposed for short-answer evaluation, items whose original form is proof-based or admits multiple valid answers fit awkwardly into a single-answer key, and natural-language proof correctness cannot be mechanically checked the way a final answer can. These are editorial observations synthesizing the cited methodological literature rather than a claim of a specific catalogued label-error count in MATH. Taken together with saturation, they support the article's existing caution: near the ceiling, and under answer-only scoring on partly public items, MATH no longer cleanly discriminates genuine mathematical reasoning among frontier systems.

Results & interpretation

Claimed scores

No claims have been recorded yet for this benchmark in the Policy Window catalog.

How to read this number

Contamination risk: medium

Some test items may leak into training corpora; treat headline scores with mild skepticism and prefer evaluation runs with held-out subsets.

What a high score does and does not establish. A score evidences performance on this benchmark’s specific construct under its specific format; it is not, on its own, evidence of general capability, reliable real-world task performance, or safety. This benchmark is saturated, so small differences near the ceiling no longer reliably separate frontier from mid-tier systems.

The second silence. evidence: thin The evidence that a benchmark score predicts real-world deployment outcomes (construct-to-deployment validity) is sparse; benchmark performance and deployed performance are not established to be the same thing, and contamination can inflate the headline figure above true held-out ability.

Governance relevance

A benchmark measures a capability; governance attaches to the topicsthat capability bears on. These topic articles carry the instrument×dimension coverage matrix and the social-science so-what for this domain.

Foundation Models / GPAI— coverage matrix + does-governance-work evidence

References

Sources cited inline in the analysis, numbered in order of appearance.

How to cite this benchmark

Use the primary methodology source for academic citations; reference the Policy Window article for the cross-model leaderboard.

Cite this article 8 formats · BibTeX, RIS, APA, Chicago, … · 1-click copy

@misc{policywindow-math-benchmark,
  title  = {MATH (Hendrycks)},
  author = {Policy Window},
  year   = {2021},
  howpublished = {MATH (2021)},
  url    = {https://policywindow.org/wiki/math-benchmark},
  note   = {Primary source: https://arxiv.org/abs/2103.03874}
}

Verify the year + paste-and-refine. Primary source linked in BibTeX/RIS note.

Permalink downloads.bib .ris .csl.json

Persistent identifier: https://policywindow.org/wiki/math-benchmark — committed-stable URL with content-versioning via ?asOf= (rollout pending per methodology §7). DOIs via Zenodo are on the roadmap.

Article tools — track changes, suggest an edit

View history — every captured revision of this article · What links here

Source: Edit on GitHub (search for `MATH`)

[ref-1] arXiv:2103.03874 ↩

[ref-2] arXiv:2206.14858 ↩

[ref-3] arXiv:2305.20050 ↩

[ref-4] arXiv:2310.18018 ↩

[ref-5] arXiv:2311.04850 ↩

[ref-6] arXiv:2406.13990 ↩

[ref-7] arXiv:2410.07985 ↩

MATH (Hendrycks)

What it measures

Saturation and score trajectory

Contamination and gaming

Critiques and limitations

Results & interpretation

Claimed scores

How to read this number

Governance relevance

See also

Further reading

References

How to cite this benchmark

MATH (Hendrycks)

What it measures

Saturation and score trajectory

Contamination and gaming

Critiques and limitations

Results & interpretation

Claimed scores

How to read this number

Governance relevance

See also

Further reading

References

How to cite this benchmark

What it measures

Saturation and score trajectory

Contamination and gaming

Critiques and limitations

Results & interpretation

Claimed scores

How to read this number

Governance relevance

See also

Related benchmarks (mathematical reasoning)

Further reading

References

How to cite this benchmark

What it measures

Saturation and score trajectory

Contamination and gaming

Critiques and limitations

Results & interpretation

Claimed scores

How to read this number

Governance relevance

See also

Related benchmarks (mathematical reasoning)

Further reading

References

How to cite this benchmark