GPQA Diamond

GPQA-DIAMOND · General reasoning

Live · 2023

GPQA Diamond is a general reasoning benchmark published in 2023 measuring graduate-level Google-Proof Q&A in biology, chemistry, physics. 'Diamond' subset is the 198 hardest items. Contamination risk: low.

What this benchmark measures

Graduate-level Google-Proof Q&A in biology, chemistry, physics. 'Diamond' subset is the 198 hardest items.

Designed to be Google-proof — questions where domain PhD students score ~65% but non-expert searchers ~34%.

Claimed scores

ModelScoreClaim typeReportedCitation
gemini-2.5-pro84 % accuracypress release2025-05-20Google DeepMind announcement
claude-opus-4-779.6 % accuracyvendor card2025-05-22Anthropic model card

Interpretation guidance

Contamination risk: low

Benchmark items are unlikely to appear in training corpora — scores are credible reflections of underlying capability.

How to cite this benchmark

Use the primary methodology source for academic citations; reference the Policy Window article for the cross-model leaderboard.

Related benchmarks (general reasoning)

  • MMLU· 2020 · high contamination
  • MMLU-Pro· 2024 · medium contamination
  • ARC-AGI v2· 2024 · low contamination

References

  1. GPQA Diamond methodology
  2. gemini-2.5-pro — 84 % accuracy (Google DeepMind announcement, 2025-05-20)
  3. claude-opus-4-7 — 79.6 % accuracy (Anthropic model card, 2025-05-22)

Take this further — sign up free

Save, compare, or get alerts when GPQA Diamond changes. Policy Window is the analyst workbench layered on top of this wiki — free for researchers, civil society, and verified policymakers.

Generated from the Policy Window catalog at . Each claim cites the originating primary source.

Wiki articles regenerate when the underlying catalog updates. Tracked revisions arrive in a future iteration; subscribe via the CTA above to be notified when this article changes.