SWE-bench Verified

SWE-BENCH-VER · Agentic tasks

Live · 2024

SWE-bench Verified is a agentic tasks benchmark published in 2024 measuring solve real-world GitHub issues from 12 popular Python repos. The 'Verified' subset is human-validated to remove ambiguity and have working tests. Contamination risk: medium.

What this benchmark measures

Solve real-world GitHub issues from 12 popular Python repos. The 'Verified' subset is human-validated to remove ambiguity and have working tests.

500-task verified subset. Run-time evaluation; can't be gamed by pure memorisation but agent harness affects results.

Claimed scores

ModelScoreClaim typeReportedCitation
claude-opus-4-778.4 % solvedvendor card2025-05-22Anthropic model card

Interpretation guidance

Contamination risk: medium

Some test items may leak into training corpora; treat headline scores with mild skepticism and prefer evaluation runs with held-out subsets.

How to cite this benchmark

Use the primary methodology source for academic citations; reference the Policy Window article for the cross-model leaderboard.

References

  1. SWE-bench Verified methodology
  2. claude-opus-4-7 — 78.4 % solved (Anthropic model card, 2025-05-22)

Take this further — sign up free

Save, compare, or get alerts when SWE-bench Verified changes. Policy Window is the analyst workbench layered on top of this wiki — free for researchers, civil society, and verified policymakers.

Generated from the Policy Window catalog at . Each claim cites the originating primary source.

Wiki articles regenerate when the underlying catalog updates. Tracked revisions arrive in a future iteration; subscribe via the CTA above to be notified when this article changes.