ARC-AGI v2

Policy Window Editorial Board

ARC-AGI v2

ARC-AGI-V2 · General reasoning

Live · 2025

Tools

Last verified 2026-06-21

Cite Share PDF

What it measures

Abstract reasoning over visual grids. Each task requires inferring the transformation rule from 2-3 examples.

v2 launched 2025-03 with harder tasks designed to remain unsolvable by pure pattern matching. $1M public prize for >85% on private set. Currency (2026-06-21): Frontier moved well past the article's top figure of ~37.6% (Late-2025 Claude Opus 4.5) — ARC-Prize-verified SOTA reached 54% ($30.57/task, Poetiq, semi-private, verified Dec 5 2025), and by June 2026 vendor/aggregator-reported public-set scores cluster far higher (GPT-5.5 ~85%, GPT-5.4 Pro 83.3%, Gemini 3.1 Pro 77.1%, Claude Opus 4.7 Adaptive 75.8% per BenchLM); the $700K Grand Prize (private set, >=85% with efficiency constraint) remains UNCLAIMED and ARC Prize 2026 now offers $2M total.

Construct & what it actually measures

ARC-AGI-2 is positioned by its authors as a measure of fluid intelligence — the capacity to acquire and apply novel skills efficiently rather than to retrieve memorised ones — operationalised through input-output grid puzzles whose transformation rule must be inferred from a handful of demonstrations ¹. The v2 redesign narrows the construct relative to ARC-AGI-1 along four task families the authors found current systems struggle with: multi-rule compositional reasoning ("multiple simultaneous rules... interacting with each other"), multi-step compositional reasoning (where the state after step N depends on step N−1), contextual rule application (a rule whose application is modulated by specific contextual cues), and in-context symbol definition, where a symbol's meaning is fixed only within the task — described as "a major challenge for frontier AI systems" ¹.

The construct-validity caveat is that ARC-AGI-2 measures few-shot inductive rule-finding over a deliberately abstract, low-prior visual-grid domain; it is not a direct measure of "general intelligence" despite the name, and the authors are explicit that intelligence is defined by the efficiency of skill acquisition, not score alone ¹. This framing reflects a wider unease about reading single-benchmark scores as general capability: emergent few-shot abilities can appear abruptly with scale and "would not have been directly predicted by extrapolating" smaller models ², so a high ARC-AGI-2 score evidences efficient novel-rule induction in this specific format — a narrower claim than general or deployment-relevant capability. (Editorial synthesis of the cited primary sources.)

Saturation & score trajectory

ARC-AGI-2 launched in March 2025 explicitly to re-open headroom after ARC-AGI-1 was effectively saturated. At release, pure (non-reasoning) LLMs scored 0%, and frontier reasoning systems sat in the low single digits: the paper's Table 1 reports o3 (Medium) at 3.0% on the semi-private set — versus 53.0% for the same system on ARC-AGI-1 — with o3-mini (High) also 3.0%, the 2024 ARChitects entry 2.5%, and Claude 3.7 at 0.9% ¹. Over 2025 the frontier climbed but remained well short of the human panel, for which 100% of retained tasks are solvable by at least two people within two attempts and the average individual human scores roughly 60% ¹.

That persisting gap matters for governance because capability gains have repeatedly proven hard to forecast from scale alone. Power-law scaling of loss with model size, data, and compute ³ underpins the hope that scores glide upward predictably, yet performance on hard tasks can instead jump discontinuously ², and compute-only extrapolation is itself unreliable once data and parameters must scale together ⁴. The trajectory below uses only figures attributable to ARC Prize's reporting and the paper: unlike v1, ARC-AGI-2 was not approaching ceiling as of late 2025 — the best Kaggle private-set entry reached only ~24%, and the highest reported semi-private scores remained roughly half the average-human baseline (ARC Prize 2025 Results and Analysis). Saturation here would imply systems matching human few-shot rule-induction efficiency, not merely high accuracy at any cost.

Contamination & gaming resistance

ARC-AGI-2's design responds directly to a documented gaming failure of ARC-AGI-1: brute-force program search. Chollet et al. report that "49% of the Private Evaluation set was successfully solved by at least one team" using brute-force search techniques, even though the winning 2020 entry scored only 20% — a gap showing the benchmark was beatable by computationally intensive search rather than genuine reasoning ¹. ARC-AGI-2 was therefore engineered to be "less brute-forcible," minimising "susceptibility to naive or computationally intensive brute-force program search" ¹.

The benchmark also mitigates training-data contamination through a tiered set structure — public (120 tasks), semi-private (120, for the live Kaggle leaderboard), and private (120, for the final contest) — so that headline figures are reported on tasks the model has not seen ¹. Such held-out evaluation is increasingly treated as a precondition for trustworthy capability claims: rigorous, leakage-resistant testing is exactly what frontier dangerous-capability pilots rely on to read "early warning signs" rather than artefacts of memorised data ⁵, and standardised model-reporting practice presses evaluators to disclose intended use and evaluation conditions alongside any headline number ⁶. Critically, the authors add an efficiency (cost-per-task) axis precisely so that unbounded compute cannot game the score: a system that solves tasks only at extreme cost (e.g. refinement pipelines reported around $30/task for ~54%) is distinguished from cheaper entries on the cost-versus-score matrix ¹. (Editorial synthesis; late-2025 cost figures attributed to ARC Prize reporting.)

Results & interpretation

Claimed scores

No claims have been recorded yet for this benchmark in the Policy Window catalog.

How to read this number

Contamination risk: low

Benchmark items are unlikely to appear in training corpora — scores are credible reflections of underlying capability.

What a high score does and does not establish. A score evidences performance on this benchmark’s specific construct under its specific format; it is not, on its own, evidence of general capability, reliable real-world task performance, or safety.

The second silence. evidence: thin The evidence that a benchmark score predicts real-world deployment outcomes (construct-to-deployment validity) is sparse; benchmark performance and deployed performance are not established to be the same thing, and contamination can inflate the headline figure above true held-out ability.

Governance relevance

A benchmark measures a capability; governance attaches to the topicsthat capability bears on. These topic articles carry the instrument×dimension coverage matrix and the social-science so-what for this domain.

Foundation Models / GPAI— coverage matrix + does-governance-work evidence

References

Sources cited inline in the analysis (linked from the superscript markers), then the primary instrument sources behind the classifications.

arXiv:2505.11831 ↩
Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, et al. (2022) Emergent Abilities of Large Language Models, arXiv (cs.CL) / TMLR. arXiv:2206.07682 — Documents 'emergent abilities' that appear only above a scale threshold and 'would not have been directly predicted by extrapolating' smaller models — a core governance unpredictability problem. ↩
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, Dario Amodei (2020) Scaling Laws for Neural Language Models, arXiv (cs.LG). arXiv:2001.08361 — Establishes that model 'loss scales as a power-law with model size, dataset size, and the amount of compute', the empirical basis for compute-threshold regulation of foundation models. ↩
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, et al. (DeepMind) (2022) Training Compute-Optimal Large Language Models, arXiv (cs.CL); NeurIPS 2022. arXiv:2203.15556 — The 'Chinchilla' study shows 'model size and the number of training tokens should be scaled equally', complicating compute-only regulatory thresholds. ↩
Mary Phuong, Matthew Aitchison, Elliot Catt, et al. (Google DeepMind) (2024) Evaluating Frontier Models for Dangerous Capabilities, arXiv (cs.LG). arXiv:2403.13793 — Pilots dangerous-capability evaluations (persuasion, cyber, self-proliferation) on frontier models, finding 'early warning signs' but no strong present danger — grounding evaluation-based gating. ↩
Mitchell et al. (2019), 'Model Cards for Model Reporting,' FAccT '19 Model Card. arXiv:1810.03993 — Mitchell et al. (2019), 'Model Cards for Model Reporting,' FAccT '19 ↩
ARC-AGI v2 methodology

How to cite this benchmark

Use the primary methodology source for academic citations; reference the Policy Window article for the cross-model leaderboard.

Cite this article 8 formats · BibTeX, RIS, APA, Chicago, … · 1-click copy

@misc{policywindow-arc-agi-v2,
  title  = {ARC-AGI v2},
  author = {Policy Window},
  year   = {2025},
  howpublished = {ARC-AGI-V2 (2025)},
  url    = {https://policywindow.org/wiki/arc-agi-v2},
  note   = {Primary source: https://arcprize.org/}
}

Verify the year + paste-and-refine. Primary source linked in BibTeX/RIS note.

Permalink downloads.bib .ris .csl.json

Persistent identifier: https://policywindow.org/wiki/arc-agi-v2 — committed-stable URL with content-versioning via ?asOf= (rollout pending per methodology §7). DOIs via Zenodo are on the roadmap.

Article tools — track changes, suggest an edit

View history — every captured revision of this article · What links here

Source: Edit on GitHub (search for `ARC-AGI-V2`)

[ref-1] arXiv:2505.11831 ↩

[ref-2] Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, et al. (2022) Emergent Abilities of Large Language Models, arXiv (cs.CL) / TMLR. arXiv:2206.07682 — Documents 'emergent abilities' that appear only above a scale threshold and 'would not have been directly predicted by extrapolating' smaller models — a core governance unpredictability problem. ↩

[ref-3] Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, Dario Amodei (2020) Scaling Laws for Neural Language Models, arXiv (cs.LG). arXiv:2001.08361 — Establishes that model 'loss scales as a power-law with model size, dataset size, and the amount of compute', the empirical basis for compute-threshold regulation of foundation models. ↩

[ref-4] Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, et al. (DeepMind) (2022) Training Compute-Optimal Large Language Models, arXiv (cs.CL); NeurIPS 2022. arXiv:2203.15556 — The 'Chinchilla' study shows 'model size and the number of training tokens should be scaled equally', complicating compute-only regulatory thresholds. ↩

[ref-5] Mary Phuong, Matthew Aitchison, Elliot Catt, et al. (Google DeepMind) (2024) Evaluating Frontier Models for Dangerous Capabilities, arXiv (cs.LG). arXiv:2403.13793 — Pilots dangerous-capability evaluations (persuasion, cyber, self-proliferation) on frontier models, finding 'early warning signs' but no strong present danger — grounding evaluation-based gating. ↩

[ref-6] Mitchell et al. (2019), 'Model Cards for Model Reporting,' FAccT '19 Model Card. arXiv:1810.03993 — Mitchell et al. (2019), 'Model Cards for Model Reporting,' FAccT '19 ↩

[ref-7] ARC-AGI v2 methodology

ARC-AGI v2

What it measures

Construct & what it actually measures

Saturation & score trajectory

Contamination & gaming resistance

Results & interpretation

Claimed scores

How to read this number

Governance relevance

See also

Further reading

References

How to cite this benchmark

ARC-AGI v2

What it measures

Construct & what it actually measures

Saturation & score trajectory

Contamination & gaming resistance

Results & interpretation

Claimed scores

How to read this number

Governance relevance

See also

Further reading

References

How to cite this benchmark

What it measures

Construct & what it actually measures

Saturation & score trajectory

Contamination & gaming resistance

Results & interpretation

Claimed scores

How to read this number

Governance relevance

See also

Related benchmarks (general reasoning)

Further reading

References

How to cite this benchmark

What it measures

Construct & what it actually measures

Saturation & score trajectory

Contamination & gaming resistance

Results & interpretation

Claimed scores

How to read this number

Governance relevance

See also

Related benchmarks (general reasoning)

Further reading

References

How to cite this benchmark