GPQA Diamond

Policy Window Editorial Board

GPQA Diamond

GPQA-DIAMOND · General reasoning

Live · 2023

Tools

Last verified 2026-06-21

Cite Share PDF

What it measures

Graduate-level Google-Proof Q&A in biology, chemistry, physics. 'Diamond' subset is the 198 hardest items.

Designed to be Google-proof — questions where domain PhD students score ~65% but non-expert searchers ~34%. Currency (2026-06-21): Thesis (saturated as discriminator; frontier clustered low-to-mid 90s) is current and named figures still valid; frontier edged past cited Gemini 3.1 Pro Preview 94.1%/GPT-5.5 ~93% (Claude Opus 4.7 ~94.2%, leaderboard ~94.6%), and Artificial Analysis down-weighted GPQA Diamond to ~6.25% of Intelligence Index v4.0 as top models cluster within 1-2 pts.

Construct and what it actually measures

GPQA's design intent is sharper than "graduate-level science Q&A": it is an attempt to operationalize *expert-discriminating, non-retrievable* knowledge. The validity evidence the authors offer is a gap, not a single score. Domain PhDs (or PhD students) in the matching field reach 65% accuracy — 74% after discounting mistakes the experts themselves identified in retrospect — while highly skilled non-experts reach only 34%, despite spending on average over 30 minutes per question with unrestricted web access ¹. That ~31-point expert/non-expert spread under open-book conditions is the benchmark's core construct-validity claim: the items index field-specific expertise rather than search skill or general literacy. The Diamond subset tightens this further — its 198 items are precisely those both expert annotators answered correctly *and* a majority of non-experts answered wrongly ¹, maximizing inter-expert agreement and expert/non-expert separation.

The construct gap worth flagging for governance readers is that a high model score is taken as evidence of "expert-level reasoning," but the format only certifies *answer selection* on multiple-choice items, not the derivation. This is the recurring hazard of inferring latent capability from benchmark scores: structured dangerous-capability evaluations are built precisely because aggregate scores are weak proxies for what a system can actually do ², and the relationship between scale-driven score gains and qualitatively new abilities is itself unpredictable ³. The benchmark's own creator has since cautioned that when a model scores 85%, it is ambiguous whether it is reasoning through novel problems "or has it seen enough similar problems in training that it's doing something closer to pattern-matched retrieval" (Rein, as reported by MindStudio 2025). The number measures graded-difficulty scientific QA performance; the leap to "capability" is an inference, not a measurement.

Saturation and score trajectory

GPQA Diamond moved from frontier challenge to near-ceiling in under two years. At release the strongest GPT-4-based baseline reached only 39% ¹ — below the ~70% PhD-expert baseline OpenAI later measured (69.7%; OpenAI o1 announcement 2024). The inflection came with reasoning models: OpenAI's o1 scored 78.3%, the first system reported to surpass the expert baseline (OpenAI 2024), and o3 reached 87.7% later that year (OpenAI o3 announcement, Dec 2024). By 2025-2026 frontier systems cluster in the low-to-mid 90s — e.g., Gemini 3.1 Pro Preview at 94.1% and GPT-5.5 at ~93% on the Artificial Analysis leaderboard (2026). Such jumps are consistent with two well-documented dynamics of scaled models: performance that improves as a power-law with model size, data, and compute ⁴, punctuated by abrupt gains on specific tasks that do not extrapolate smoothly from smaller systems ³.

The implication is that GPQA Diamond has largely saturated as a *discriminating* instrument at the frontier: with a 198-item set, a one-question swing is ~0.5 percentage points, so differences among top models fall inside measurement noise and inter-run variance. The benchmark's creator concurs, noting models in "the 80s and 90s" caused it to "stop discriminating between good and great," and describing GPQA as "a stepping stone, not a destination" (Rein, MindStudio 2025). For policy use, this means recent near-ceiling scores certify that the *capability frontier has cleared* this bar rather than ranking systems against each other.

Contamination, format sensitivity, and gaming

GPQA was engineered against contamination — "Google-proof" items, written by experts and partly withheld, so that score gains should reflect capability rather than memorized text. The Diamond subset is the highest-objectivity slice (198 items both expert annotators got right and most non-experts missed), and the authors gate the gold set on inter-expert agreement ¹. This is why the Policy Window catalog rates its contamination risk as low. But the creator stresses the protection is not permanent: "any fixed benchmark eventually gets trained against, either explicitly through data contamination or implicitly through general capability improvements" (Rein, MindStudio 2025) — the rationale for vetted/withheld variants of difficult benchmarks generally.

Two measurement caveats also bear on how reported gains should be read. First, format sensitivity: multiple-choice scoring on GPQA Diamond does shift with answer-option ordering and prompt phrasing, but a systematic study across twelve prompt templates concludes this variation is "more an artifact of evaluation than a flaw in the models" — once rigid string-matching is replaced by LLM-as-a-judge scoring, modern LLMs are "more robust to prompt templates than previously believed," so most format-driven movement does not reflect a genuine reasoning deficit ⁵. Second, small-set variance: because the set is only 198 items, run-to-run and seed-to-seed fluctuation can rival the spread between adjacent frontier models, a documented route to "strategic overclaiming" through favorable evaluation design ⁶. The label quality itself holds up — independent review near saturation found ~90-95% of items valid, with only roughly 2-3 of 198 seriously ambiguous (review summarized by IntuitionLabs 2025) — so the residual frontier gap is mostly genuine difficulty rather than flawed keys.

Results & interpretation

Claimed scores

Model	Score	Claim type	Reported	Citation
gemini-2.5-pro	84 % accuracy	press release	2025-05-20	Google DeepMind announcement
claude-opus-4-7	79.6 % accuracy	vendor card	2025-05-22	Anthropic model card

How to read this number

Contamination risk: low

Benchmark items are unlikely to appear in training corpora — scores are credible reflections of underlying capability.

What a high score does and does not establish. A score evidences performance on this benchmark’s specific construct under its specific format; it is not, on its own, evidence of general capability, reliable real-world task performance, or safety.

The second silence. evidence: thin The evidence that a benchmark score predicts real-world deployment outcomes (construct-to-deployment validity) is sparse; benchmark performance and deployed performance are not established to be the same thing, and contamination can inflate the headline figure above true held-out ability.

Governance relevance

A benchmark measures a capability; governance attaches to the topicsthat capability bears on. These topic articles carry the instrument×dimension coverage matrix and the social-science so-what for this domain.

Foundation Models / GPAI— coverage matrix + does-governance-work evidence

References

Sources cited inline in the analysis (linked from the superscript markers), then the primary instrument sources behind the classifications.

arXiv:2311.12022 ↩
Mary Phuong, Matthew Aitchison, Elliot Catt, et al. (Google DeepMind) (2024) Evaluating Frontier Models for Dangerous Capabilities, arXiv (cs.LG). arXiv:2403.13793 — Pilots dangerous-capability evaluations (persuasion, cyber, self-proliferation) on frontier models, finding 'early warning signs' but no strong present danger — grounding evaluation-based gating. ↩
Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, et al. (2022) Emergent Abilities of Large Language Models, arXiv (cs.CL) / TMLR. arXiv:2206.07682 — Documents 'emergent abilities' that appear only above a scale threshold and 'would not have been directly predicted by extrapolating' smaller models — a core governance unpredictability problem. ↩
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, Dario Amodei (2020) Scaling Laws for Neural Language Models, arXiv (cs.LG). arXiv:2001.08361 — Establishes that model 'loss scales as a power-law with model size, dataset size, and the amount of compute', the empirical basis for compute-threshold regulation of foundation models. ↩
arXiv:2509.01790 ↩
arXiv:2506.04734 ↩
gemini-2.5-pro — 84 % accuracy (Google DeepMind announcement, 2025-05-20)
claude-opus-4-7 — 79.6 % accuracy (Anthropic model card, 2025-05-22)

How to cite this benchmark

Use the primary methodology source for academic citations; reference the Policy Window article for the cross-model leaderboard.

Cite this article 8 formats · BibTeX, RIS, APA, Chicago, … · 1-click copy

@misc{policywindow-gpqa-diamond,
  title  = {GPQA Diamond},
  author = {Policy Window},
  year   = {2023},
  howpublished = {GPQA-DIAMOND (2023)},
  url    = {https://policywindow.org/wiki/gpqa-diamond},
  note   = {Primary source: https://arxiv.org/abs/2311.12022}
}

Verify the year + paste-and-refine. Primary source linked in BibTeX/RIS note.

Permalink downloads.bib .ris .csl.json

Persistent identifier: https://policywindow.org/wiki/gpqa-diamond — committed-stable URL with content-versioning via ?asOf= (rollout pending per methodology §7). DOIs via Zenodo are on the roadmap.

Article tools — track changes, suggest an edit

View history — every captured revision of this article · What links here

Source: Edit on GitHub (search for `GPQA-DIAMOND`)

GPQA Diamond

GPQA-DIAMOND · General reasoning

Live · 2023

Tools

Last verified 2026-06-21

Cite Share PDF

What it measures

Graduate-level Google-Proof Q&A in biology, chemistry, physics. 'Diamond' subset is the 198 hardest items.

Construct and what it actually measures

Saturation and score trajectory

Contamination, format sensitivity, and gaming

Results & interpretation

Claimed scores

Model	Score	Claim type	Reported	Citation
gemini-2.5-pro	84 % accuracy	press release	2025-05-20	Google DeepMind announcement
claude-opus-4-7	79.6 % accuracy	vendor card	2025-05-22	Anthropic model card

How to read this number

Contamination risk: low

Benchmark items are unlikely to appear in training corpora — scores are credible reflections of underlying capability.

Governance relevance

Foundation Models / GPAI— coverage matrix + does-governance-work evidence

References

Sources cited inline in the analysis (linked from the superscript markers), then the primary instrument sources behind the classifications.

arXiv:2311.12022 ↩
Mary Phuong, Matthew Aitchison, Elliot Catt, et al. (Google DeepMind) (2024) Evaluating Frontier Models for Dangerous Capabilities, arXiv (cs.LG). arXiv:2403.13793 — Pilots dangerous-capability evaluations (persuasion, cyber, self-proliferation) on frontier models, finding 'early warning signs' but no strong present danger — grounding evaluation-based gating. ↩
Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, et al. (2022) Emergent Abilities of Large Language Models, arXiv (cs.CL) / TMLR. arXiv:2206.07682 — Documents 'emergent abilities' that appear only above a scale threshold and 'would not have been directly predicted by extrapolating' smaller models — a core governance unpredictability problem. ↩
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, Dario Amodei (2020) Scaling Laws for Neural Language Models, arXiv (cs.LG). arXiv:2001.08361 — Establishes that model 'loss scales as a power-law with model size, dataset size, and the amount of compute', the empirical basis for compute-threshold regulation of foundation models. ↩
arXiv:2509.01790 ↩
arXiv:2506.04734 ↩
gemini-2.5-pro — 84 % accuracy (Google DeepMind announcement, 2025-05-20)
claude-opus-4-7 — 79.6 % accuracy (Anthropic model card, 2025-05-22)

How to cite this benchmark

Use the primary methodology source for academic citations; reference the Policy Window article for the cross-model leaderboard.

Cite this article 8 formats · BibTeX, RIS, APA, Chicago, … · 1-click copy

@misc{policywindow-gpqa-diamond,
  title  = {GPQA Diamond},
  author = {Policy Window},
  year   = {2023},
  howpublished = {GPQA-DIAMOND (2023)},
  url    = {https://policywindow.org/wiki/gpqa-diamond},
  note   = {Primary source: https://arxiv.org/abs/2311.12022}
}

Verify the year + paste-and-refine. Primary source linked in BibTeX/RIS note.

Permalink downloads.bib .ris .csl.json

Article tools — track changes, suggest an edit

View history — every captured revision of this article · What links here

Source: Edit on GitHub (search for `GPQA-DIAMOND`)

[ref-1] arXiv:2311.12022 ↩

[ref-2] Mary Phuong, Matthew Aitchison, Elliot Catt, et al. (Google DeepMind) (2024) Evaluating Frontier Models for Dangerous Capabilities, arXiv (cs.LG). arXiv:2403.13793 — Pilots dangerous-capability evaluations (persuasion, cyber, self-proliferation) on frontier models, finding 'early warning signs' but no strong present danger — grounding evaluation-based gating. ↩

[ref-3] Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, et al. (2022) Emergent Abilities of Large Language Models, arXiv (cs.CL) / TMLR. arXiv:2206.07682 — Documents 'emergent abilities' that appear only above a scale threshold and 'would not have been directly predicted by extrapolating' smaller models — a core governance unpredictability problem. ↩

[ref-4] Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, Dario Amodei (2020) Scaling Laws for Neural Language Models, arXiv (cs.LG). arXiv:2001.08361 — Establishes that model 'loss scales as a power-law with model size, dataset size, and the amount of compute', the empirical basis for compute-threshold regulation of foundation models. ↩

[ref-5] arXiv:2509.01790 ↩

[ref-6] arXiv:2506.04734 ↩

[ref-7] gemini-2.5-pro — 84 % accuracy (Google DeepMind announcement, 2025-05-20)

[ref-8] claude-opus-4-7 — 79.6 % accuracy (Anthropic model card, 2025-05-22)

GPQA Diamond

What it measures

Construct and what it actually measures

Saturation and score trajectory

Contamination, format sensitivity, and gaming

Results & interpretation

Claimed scores

How to read this number

Governance relevance

See also

Further reading

References

How to cite this benchmark

GPQA Diamond

What it measures

Construct and what it actually measures

Saturation and score trajectory

Contamination, format sensitivity, and gaming

Results & interpretation

Claimed scores

How to read this number

Governance relevance

See also

Further reading

References

How to cite this benchmark

What it measures

Construct and what it actually measures

Saturation and score trajectory

Contamination, format sensitivity, and gaming

Results & interpretation

Claimed scores

How to read this number

Governance relevance

See also

Related benchmarks (general reasoning)

Further reading

References

How to cite this benchmark

What it measures

Construct and what it actually measures

Saturation and score trajectory

Contamination, format sensitivity, and gaming

Results & interpretation

Claimed scores

How to read this number

Governance relevance

See also

Related benchmarks (general reasoning)

Further reading

References

How to cite this benchmark