SWE-bench Verified
SWE-BENCH-VER · Agentic tasks
SWE-bench Verified is a agentic tasks benchmark published in 2024 measuring solve real-world GitHub issues from 12 popular Python repos. The 'Verified' subset is human-validated to remove ambiguity and have working tests. Contamination risk: medium.
What this benchmark measures
Solve real-world GitHub issues from 12 popular Python repos. The 'Verified' subset is human-validated to remove ambiguity and have working tests.
500-task verified subset. Run-time evaluation; can't be gamed by pure memorisation but agent harness affects results.
Claimed scores
| Model | Score | Claim type | Reported | Citation |
|---|---|---|---|---|
| claude-opus-4-7 | 78.4 % solved | vendor card | 2025-05-22 | Anthropic model card |
Interpretation guidance
Contamination risk: medium
Some test items may leak into training corpora; treat headline scores with mild skepticism and prefer evaluation runs with held-out subsets.
How to cite this benchmark
Use the primary methodology source for academic citations; reference the Policy Window article for the cross-model leaderboard.
- Primary methodology:https://openai.com/index/introducing-swe-bench-verified/
- Wiki article:
https://policywindow.org/wiki/swe-bench-verified
References
Take this further — sign up free
Save, compare, or get alerts when SWE-bench Verified changes. Policy Window is the analyst workbench layered on top of this wiki — free for researchers, civil society, and verified policymakers.