AIME 2024

Policy Window Editorial Board

AIME 2024

AIME-2024 · Mathematical reasoning

Live · 2024

Tools

Last verified 2026-06-21

Cite Share PDF

What it measures

30 problems from the 2024 American Invitational Mathematics Examination — high-school competition math.

Released after most current models' training cutoffs. Top reasoning models 75-90%; non-reasoning 10-30%. Currency (2026-06-21): Frontier has climbed past the article top o3 96.7% figure - GPT-5 (~95.7%), Grok 4 (~94.3%), and Gemini 3 Deep Think (98-99%) now top AIME leaderboards, reinforcing the near-saturation thesis; the table could add a post-2024-model row, but caution per the existing iter-449f audit note that many headline figures (e.g. OpenAI 94.6%, Grok 4 100%) are AIME 2025, not AIME 2024.

Construct & what it actually measures

AIME 2024 is widely read as a measure of multi-step mathematical reasoning, but its scoring construct is narrower than that framing implies. Each of the 30 problems is graded on a single integer answer in [0, 999], with no inspection of the intervening derivation. Final-answer matching therefore conflates sound reasoning with two confounds: arriving at the correct number through flawed or incomplete logic, and the non-trivial base rate of guessing within a bounded integer range. The gap is large where it has been measured directly: when frontier models' full proofs on the 2025 USA Mathematical Olympiad were graded by expert humans rather than by final answer, only Gemini 2.5 Pro reached a non-trivial 25% and all other models scored under 5%, despite the same systems posting high answer-only accuracy on AIME-style tasks ¹.

The construct is also brittle to surface form. VAR-MATH symbolically replaces the numeric constants in AIME24 items with variables that preserve difficulty; reinforcement-learning-trained models' accuracy fell by an average of 58.3% on these AIME24 isomorphs (and 48.0% on the parallel AMC23 set), indicating that much measured "reasoning" tracks memorized numeric surface statistics rather than transferable procedure ². Finally, with only 30 items, a single problem moves a score by 3.3 points; reported pass@1 standard deviations of several percentage points across random seeds make small leaderboard differences statistically indistinguishable ³. Editorial note: these are construct caveats, not a claim that AIME measures nothing.

Saturation & score trajectory

Frontier scores on AIME 2024 climbed from near-floor to near-ceiling within roughly a year, driven by the shift from general-purpose to inference-time-reasoning models. GPT-4o, a strong non-reasoning model, solved on average about 12% (reported as 13.4% pass@1) of the 2024 problems (OpenAI, "Learning to Reason with LLMs," 2024-09-12). The same release reported OpenAI o1 at 74.4% pass@1, rising to 83.3% with majority vote over 64 samples and ~93% with learned re-ranking over 1,000 samples — a single-day jump of roughly 60 points over GPT-4o on the same items. DeepSeek-R1 then reported 79.8% pass@1, with its base model DeepSeek-V3 at 39.2% and OpenAI o1-1217 at 79.2% ⁴. OpenAI o3 reported 96.7% (OpenAI, o3 announcement, 2024-12 / 2025-04).

That the discontinuity tracks a paradigm shift rather than steady scaling is consistent with the observation that some capabilities surface only above a threshold and "would not have been directly predicted by extrapolating" smaller models ⁵, even though loss itself "scales as a power-law with model size, dataset size, and the amount of compute" ⁶. The implication of near-saturation is that AIME 2024 has limited remaining discriminative power at the frontier: once leading models cluster in the 80-97% band, score differences are increasingly dominated by sampling variance and contamination (see below) rather than capability gaps. This is why evaluation has migrated toward forward-only, freshly released contests (AIME 2025, MathArena) and proof-graded olympiad sets (MathArena 2025). Figures here are vendor- or paper-reported and mix pass@1 and aggregated decoding strategies, which are not directly comparable; read each row with its claim type.

Contamination & gaming

AIME 2024's at-a-glance "low contamination risk" rests on the timing argument — the 2024 contest fell after many models' stated training cutoffs — but a body of subsequent work argues the risk is materially higher in practice, because the problems and worked solutions circulated widely online and entered later web-scale corpora and RL post-training sets ⁷. Wu et al. show that for contamination-susceptible series such as Qwen2.5, even random or incorrect RL reward signals can produce apparent gains on AIME, MATH-500 and AMC, whereas on their leakage-free RandomCalculation benchmark only accurate rewards improve over the base model — a signature of memorized test items rather than learned reasoning ⁸.

MathArena reports "strong signs of contamination in AIME 2024" and finds that models exceed the human 1% quantile by 10-20 points on the 2024 set while their 2025-contest scores align with human expectations, consistent with inflation on the older, more-circulated items ⁹. The symbolic-variabilization result above (an average −58.3% on VAR-AIME24) is corroborating evidence that surface familiarity, not generalization, carries part of the score ². The standard mitigations are forward-only evaluation on freshly released contests (the explicit MathArena design and the rationale for the parallel AIME 2025 set) and structural perturbation (VAR-MATH) ². Editorial judgment: the "low risk" label is defensible only under the narrow timing definition; under behavioral and perturbation tests the benchmark shows contamination-consistent inflation, so AIME 2024 scores should be read as an upper bound on reasoning capability.

Results & interpretation

Claimed scores

No claims have been recorded yet for this benchmark in the Policy Window catalog.

How to read this number

Contamination risk: low

Benchmark items are unlikely to appear in training corpora — scores are credible reflections of underlying capability.

What a high score does and does not establish. A score evidences performance on this benchmark’s specific construct under its specific format; it is not, on its own, evidence of general capability, reliable real-world task performance, or safety.

The second silence. evidence: thin The evidence that a benchmark score predicts real-world deployment outcomes (construct-to-deployment validity) is sparse; benchmark performance and deployed performance are not established to be the same thing, and contamination can inflate the headline figure above true held-out ability.

Governance relevance

A benchmark measures a capability; governance attaches to the topicsthat capability bears on. These topic articles carry the instrument×dimension coverage matrix and the social-science so-what for this domain.

Foundation Models / GPAI— coverage matrix + does-governance-work evidence

References

Sources cited inline in the analysis (linked from the superscript markers), then the primary instrument sources behind the classifications.

arXiv:2503.21934 ↩
arXiv:2507.12885 ↩
arXiv:2504.07086 ↩
arXiv:2501.12948 ↩
Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, et al. (2022) Emergent Abilities of Large Language Models, arXiv (cs.CL) / TMLR. arXiv:2206.07682 — Documents 'emergent abilities' that appear only above a scale threshold and 'would not have been directly predicted by extrapolating' smaller models — a core governance unpredictability problem. ↩
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, Dario Amodei (2020) Scaling Laws for Neural Language Models, arXiv (cs.LG). arXiv:2001.08361 — Establishes that model 'loss scales as a power-law with model size, dataset size, and the amount of compute', the empirical basis for compute-threshold regulation of foundation models. ↩
arXiv:2510.02386 ↩
arXiv:2507.10532 ↩
arXiv:2505.23281 ↩
AIME 2024 methodology

How to cite this benchmark

Use the primary methodology source for academic citations; reference the Policy Window article for the cross-model leaderboard.

Cite this article 8 formats · BibTeX, RIS, APA, Chicago, … · 1-click copy

@misc{policywindow-aime-2024,
  title  = {AIME 2024},
  author = {Policy Window},
  year   = {2024},
  howpublished = {AIME-2024 (2024)},
  url    = {https://policywindow.org/wiki/aime-2024},
  note   = {Primary source: https://www.maa.org/math-competitions/american-invitational-mathematics-examination-aime}
}

Verify the year + paste-and-refine. Primary source linked in BibTeX/RIS note.

Permalink downloads.bib .ris .csl.json

Persistent identifier: https://policywindow.org/wiki/aime-2024 — committed-stable URL with content-versioning via ?asOf= (rollout pending per methodology §7). DOIs via Zenodo are on the roadmap.

Article tools — track changes, suggest an edit

View history — every captured revision of this article · What links here

Source: Edit on GitHub (search for `AIME-2024`)

[ref-1] arXiv:2503.21934 ↩

[ref-2] arXiv:2507.12885 ↩

[ref-3] arXiv:2504.07086 ↩

[ref-4] arXiv:2501.12948 ↩

[ref-5] Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, et al. (2022) Emergent Abilities of Large Language Models, arXiv (cs.CL) / TMLR. arXiv:2206.07682 — Documents 'emergent abilities' that appear only above a scale threshold and 'would not have been directly predicted by extrapolating' smaller models — a core governance unpredictability problem. ↩

[ref-6] Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, Dario Amodei (2020) Scaling Laws for Neural Language Models, arXiv (cs.LG). arXiv:2001.08361 — Establishes that model 'loss scales as a power-law with model size, dataset size, and the amount of compute', the empirical basis for compute-threshold regulation of foundation models. ↩

[ref-7] arXiv:2510.02386 ↩

[ref-8] arXiv:2507.10532 ↩

[ref-9] arXiv:2505.23281 ↩

[ref-10] AIME 2024 methodology

AIME 2024

What it measures

Construct & what it actually measures

Saturation & score trajectory

Contamination & gaming

Results & interpretation

Claimed scores

How to read this number

Governance relevance

See also

Further reading

References

How to cite this benchmark

AIME 2024

What it measures

Construct & what it actually measures

Saturation & score trajectory

Contamination & gaming

Results & interpretation

Claimed scores

How to read this number

Governance relevance

See also

Further reading

References

How to cite this benchmark

What it measures

Construct & what it actually measures

Saturation & score trajectory

Contamination & gaming

Results & interpretation

Claimed scores

How to read this number

Governance relevance

See also

Related benchmarks (mathematical reasoning)

Further reading

References

How to cite this benchmark

What it measures

Construct & what it actually measures

Saturation & score trajectory

Contamination & gaming

Results & interpretation

Claimed scores

How to read this number

Governance relevance

See also

Related benchmarks (mathematical reasoning)

Further reading

References

How to cite this benchmark