GPQA Diamond
GPQA-DIAMOND · General reasoning
GPQA Diamond is a general reasoning benchmark published in 2023 measuring graduate-level Google-Proof Q&A in biology, chemistry, physics. 'Diamond' subset is the 198 hardest items. Contamination risk: low.
What this benchmark measures
Graduate-level Google-Proof Q&A in biology, chemistry, physics. 'Diamond' subset is the 198 hardest items.
Designed to be Google-proof — questions where domain PhD students score ~65% but non-expert searchers ~34%.
Claimed scores
| Model | Score | Claim type | Reported | Citation |
|---|---|---|---|---|
| gemini-2.5-pro | 84 % accuracy | press release | 2025-05-20 | Google DeepMind announcement |
| claude-opus-4-7 | 79.6 % accuracy | vendor card | 2025-05-22 | Anthropic model card |
Interpretation guidance
Contamination risk: low
Benchmark items are unlikely to appear in training corpora — scores are credible reflections of underlying capability.
How to cite this benchmark
Use the primary methodology source for academic citations; reference the Policy Window article for the cross-model leaderboard.
- Primary methodology:https://arxiv.org/abs/2311.12022
- Wiki article:
https://policywindow.org/wiki/gpqa-diamond
Related benchmarks (general reasoning)
- MMLU· 2020 · high contamination
- MMLU-Pro· 2024 · medium contamination
- ARC-AGI v2· 2024 · low contamination
References
Take this further — sign up free
Save, compare, or get alerts when GPQA Diamond changes. Policy Window is the analyst workbench layered on top of this wiki — free for researchers, civil society, and verified policymakers.