FrontierMath

Policy Window Editorial Board

FrontierMath

FRONTIER-MATH · Mathematical reasoning

Live · 2024

Tools

Last verified 2026-06-21

Cite Share PDF

What it measures

Hundreds of original research-mathematician-curated math problems requiring deep reasoning. Held-out evaluation only.

Epoch AI eval. Top reasoning models 2-5% at launch; OpenAI o3-preview reported 25% under custom harness. Currency (2026-06-21): Article's data stops at Apr 2025 (o4-mini 17%, o3 10%); since then SOTA on Tiers 1-3 rose to >40% (GPT-5.2 / Claude Opus 4.6, per IEEE Spectrum) and Epoch's own record on Tier 4 hit 31% (GPT-5.2 Pro, 15/48, Jan 2026, up from a 19% prior max), plus Epoch shipped FrontierMath v2 on 2026-06-12 — an AI-assisted audit found critical errors in ~42% of problems (135 corrected, 12 removed → 338 total), a notable validity finding.

Construct: what it actually measures

FrontierMath is widely read as a proxy for frontier mathematical *reasoning*, but its authors and reviewers caution that the construct is narrower than that framing implies. The benchmark is auto-graded, so every problem is engineered to have a single closed-form answer that is "either numerical values or SymPy-verifiable symbolic expressions" ¹, typically a large integer or specific constant chosen so that random guessing has under a 1% success rate. This design buys objective, manual-grading-free scoring at the cost of measuring answer-finding rather than proof-construction.

The distinction is load-bearing for what a score licenses. Fields medalist Richard Borcherds observed that the benchmark problems "aren't quite the same as coming up with original proofs," and FrontierMath author Greg Burnham notes that "a significant chunk of FrontierMath problems can be solved by applying advanced mathematical techniques in relatively straightforward ways," suspecting this accounts for much of leading models' performance (Burnham 2025). On this reading the benchmark indexes breadth of advanced mathematical *background* and reliable technical execution more than creative insight — the Fields-medalist panel suggested it is most valuable for gauging "routine technical work" (Epoch AI 2024). The interpretive caution generalises: capability measures can shift non-linearly with scale, since emergent abilities "cannot be predicted simply by extrapolating the performance of smaller models" ², so a single number is a fragile basis for inference. A high FrontierMath score is therefore evidence of competent terminal-answer derivation on research-flavoured problems, not of autonomous theorem-proving — a gap any governance inference drawn from the number must respect (composite editorial judgment).

Saturation and score trajectory

FrontierMath was published in November 2024 with frontier models at single-digit accuracy: "current state-of-the-art AI models solve under 2% of problems" ¹, a figure spanning GPT-4o, Claude 3.5 Sonnet, o1-preview and Gemini 1.5 Pro. The trajectory since has been steep. On 20 December 2024 OpenAI reported o3-preview at 25.2% — a >10x jump announced the same day the partnership behind the benchmark surfaced (OpenAI 2024; TechCrunch 2025-01-19). Epoch AI's own independent evaluation in April 2025 placed o4-mini (high reasoning) at 17% (±2%) and o3 at 10% (±2%) on the full set (Epoch AI 2025), illustrating that vendor-harness headline numbers and held-out re-evaluations can diverge materially.

The rapid climb has practical consequences for the benchmark's shelf life. To preserve discriminating power, Epoch separated the corpus into Tiers 1–3 (300 problems, undergraduate-to-graduate) and added a Tier 4 expansion of 50 research-level problems "designed to vastly exceed the difficulty of even the Tier 3 problems," completed in June 2025, plus an Open Problems collection of unsolved questions (Epoch AI 2025). When the original FrontierMath set was unveiled, Terence Tao had described its problems as "extremely challenging" and predicted they would "resist AIs for several years at least" (VentureBeat, Nov 8 2024). Because such scores are increasingly read as gating signals, structured dangerous-capability evaluations on frontier models report "early warning signs" rather than decisive thresholds ³, and proposals to operationalise them stress that "government intervention will be needed" beyond voluntary reporting ⁴. Saturation on Tiers 1–3 thus measures progress against a moving, deliberately re-segmented target, and a single FrontierMath percentage is only interpretable alongside the tier and harness it was produced under (composite editorial judgment).

Contamination, access asymmetry, and gaming

FrontierMath's headline contamination defense is that its problems are original and unpublished, so they are unlikely to sit in training corpora — the basis for the "low contamination risk" framing, which echoes wider cautions that a model's defects are "inherited by all the adapted models downstream" ⁵. That safeguard was complicated by the disclosure that the benchmark's funder, OpenAI, also held privileged access to it. Epoch AI revealed OpenAI's funding only on 20 December 2024, alongside OpenAI's 25.2% o3 result, and many problem contributors were not told beforehand (TechCrunch 2025-01-19). OpenAI had "access to a large fraction of the problems and solutions," governed by a "verbal agreement" not to train on them, and Epoch's Tamay Besiroglu conceded the organisation "made a mistake" in not negotiating to disclose the relationship earlier (TechCrunch 2025-01-19).

Two mitigations followed. First, Epoch retained a held-out set the funder had not seen, enabling independent re-evaluation; lead mathematician Elliot Glazer noted Epoch "can't vouch for" the vendor figure "until our independent evaluation is complete" (TechCrunch 2025-01-19) — the subsequent Epoch numbers (17%/10%) came in below the 25.2% headline. The asymmetry it exposed is exactly the kind that motivates controlled "structured access" to evaluation artefacts rather than open release ⁶. Second, the auto-verifiable large-integer answer format is itself an anti-gaming device: with guessing success below 1%, brute-force and lucky shortcuts are largely foreclosed by construction ¹. A residual, harder-to-audit pathway remains, however: Burnham notes a funder could ensure relevant mathematics papers entered a model's training data without violating a no-train-on-problems pledge (Burnham 2025). The episode is now a standard case study in benchmark-governance transparency norms (composite editorial judgment).

Results & interpretation

Claimed scores

No claims have been recorded yet for this benchmark in the Policy Window catalog.

How to read this number

Contamination risk: low

Benchmark items are unlikely to appear in training corpora — scores are credible reflections of underlying capability.

What a high score does and does not establish. A score evidences performance on this benchmark’s specific construct under its specific format; it is not, on its own, evidence of general capability, reliable real-world task performance, or safety.

The second silence. evidence: thin The evidence that a benchmark score predicts real-world deployment outcomes (construct-to-deployment validity) is sparse; benchmark performance and deployed performance are not established to be the same thing, and contamination can inflate the headline figure above true held-out ability.

Governance relevance

A benchmark measures a capability; governance attaches to the topicsthat capability bears on. These topic articles carry the instrument×dimension coverage matrix and the social-science so-what for this domain.

Foundation Models / GPAI— coverage matrix + does-governance-work evidence

References

Sources cited inline in the analysis (linked from the superscript markers), then the primary instrument sources behind the classifications.

arXiv:2411.04872 ↩
Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, et al. (2022) Emergent Abilities of Large Language Models, arXiv (cs.CL) / TMLR. arXiv:2206.07682 — Documents 'emergent abilities' that appear only above a scale threshold and 'would not have been directly predicted by extrapolating' smaller models — a core governance unpredictability problem. ↩
Mary Phuong, Matthew Aitchison, Elliot Catt, et al. (Google DeepMind) (2024) Evaluating Frontier Models for Dangerous Capabilities, arXiv (cs.LG). arXiv:2403.13793 — Pilots dangerous-capability evaluations (persuasion, cyber, self-proliferation) on frontier models, finding 'early warning signs' but no strong present danger — grounding evaluation-based gating. ↩
Anderljung, Barnhart, Korinek, et al. (2023) Frontier AI Regulation: Managing Emerging Risks to Public Safety, arXiv. arXiv:2307.03718 — Argues "industry self-regulation is an important first step" but "government intervention will be needed", proposing safety standards, registration and reporting, and compliance mechanisms. ↩
Bommasani et al. (2021) On the Opportunities and Risks of Foundation Models, arXiv. arXiv:2108.07258 — Defines foundation models and warns homogenization "demands caution, as the defects of the foundation model are inherited by all the adapted models downstream". ↩
Toby Shevlane (2022) Structured access: an emerging paradigm for safe AI deployment, arXiv (cs.CY); The Oxford Handbook of AI Governance. arXiv:2201.05159 — Proposes controlled, cloud-mediated 'structured access' to 'prevent dangerous AI capabilities from being widely accessible, whilst preserving access to AI capabilities that can be used safely'. ↩
FrontierMath methodology

How to cite this benchmark

Use the primary methodology source for academic citations; reference the Policy Window article for the cross-model leaderboard.

Cite this article 8 formats · BibTeX, RIS, APA, Chicago, … · 1-click copy

@misc{policywindow-frontiermath,
  title  = {FrontierMath},
  author = {Policy Window},
  year   = {2024},
  howpublished = {FRONTIER-MATH (2024)},
  url    = {https://policywindow.org/wiki/frontiermath},
  note   = {Primary source: https://epochai.org/frontiermath}
}

Verify the year + paste-and-refine. Primary source linked in BibTeX/RIS note.

Permalink downloads.bib .ris .csl.json

Persistent identifier: https://policywindow.org/wiki/frontiermath — committed-stable URL with content-versioning via ?asOf= (rollout pending per methodology §7). DOIs via Zenodo are on the roadmap.

Article tools — track changes, suggest an edit

View history — every captured revision of this article · What links here

Source: Edit on GitHub (search for `FRONTIER-MATH`)

[ref-1] arXiv:2411.04872 ↩

[ref-2] Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, et al. (2022) Emergent Abilities of Large Language Models, arXiv (cs.CL) / TMLR. arXiv:2206.07682 — Documents 'emergent abilities' that appear only above a scale threshold and 'would not have been directly predicted by extrapolating' smaller models — a core governance unpredictability problem. ↩

[ref-3] Mary Phuong, Matthew Aitchison, Elliot Catt, et al. (Google DeepMind) (2024) Evaluating Frontier Models for Dangerous Capabilities, arXiv (cs.LG). arXiv:2403.13793 — Pilots dangerous-capability evaluations (persuasion, cyber, self-proliferation) on frontier models, finding 'early warning signs' but no strong present danger — grounding evaluation-based gating. ↩

[ref-4] Anderljung, Barnhart, Korinek, et al. (2023) Frontier AI Regulation: Managing Emerging Risks to Public Safety, arXiv. arXiv:2307.03718 — Argues "industry self-regulation is an important first step" but "government intervention will be needed", proposing safety standards, registration and reporting, and compliance mechanisms. ↩

[ref-5] Bommasani et al. (2021) On the Opportunities and Risks of Foundation Models, arXiv. arXiv:2108.07258 — Defines foundation models and warns homogenization "demands caution, as the defects of the foundation model are inherited by all the adapted models downstream". ↩

[ref-6] Toby Shevlane (2022) Structured access: an emerging paradigm for safe AI deployment, arXiv (cs.CY); The Oxford Handbook of AI Governance. arXiv:2201.05159 — Proposes controlled, cloud-mediated 'structured access' to 'prevent dangerous AI capabilities from being widely accessible, whilst preserving access to AI capabilities that can be used safely'. ↩

[ref-7] FrontierMath methodology

FrontierMath

What it measures

Construct: what it actually measures

Saturation and score trajectory

Contamination, access asymmetry, and gaming

Results & interpretation

Claimed scores

How to read this number

Governance relevance

See also

Further reading

References

How to cite this benchmark

FrontierMath

What it measures

Construct: what it actually measures

Saturation and score trajectory

Contamination, access asymmetry, and gaming

Results & interpretation

Claimed scores

How to read this number

Governance relevance

See also

Further reading

References

How to cite this benchmark

What it measures

Construct: what it actually measures

Saturation and score trajectory

Contamination, access asymmetry, and gaming

Results & interpretation

Claimed scores

How to read this number

Governance relevance

See also

Related benchmarks (mathematical reasoning)

Further reading

References

How to cite this benchmark

What it measures

Construct: what it actually measures

Saturation and score trajectory

Contamination, access asymmetry, and gaming

Results & interpretation

Claimed scores

How to read this number

Governance relevance

See also

Related benchmarks (mathematical reasoning)

Further reading

References

How to cite this benchmark