SWE-bench Verified

Policy Window Editorial Board

SWE-bench Verified

SWE-BENCH-VER · Agentic tasks

Saturated · 2024

Tools

Last verified 2026-06-21

Cite Share PDF

What it measures

Solve real-world GitHub issues from 12 popular Python repos. The 'Verified' subset is human-validated to remove ambiguity and have working tests.

500-task verified subset. Run-time evaluation; can't be gamed by pure memorisation but agent harness affects results. Currency (2026-06-21): Verified is now effectively saturated/retired — OpenAI's now-live post "Why we no longer evaluate SWE-bench Verified" recommends the concrete named successor SWE-bench Pro (Scale AI, 1,865 tasks); frontier % resolved has moved well past the tracked 78.4% (Claude Opus 4.8 ~88.6%, May 2026; OpenAI cites 74.9%->80.9% in 6 months); saturationStatus updated to "saturated" to match the body, and OpenAI now points to the SWE-bench Pro successor (Scale AI) (iter-451 audit fix).

Contamination & gaming

SWE-bench Verified is a run-time benchmark — a candidate patch is applied and executed against the repository's tests — so it is not gameable by pure text memorisation in the way a multiple-choice exam is; this execution-grounded design aligns it with agentic-evaluation regimes that score multi-step tool-use behaviour rather than recall ¹. The article's at-a-glance "medium" contamination rating nonetheless rests on a concrete temporal problem: the task instances are drawn from real GitHub issues and their resolving pull requests, and over 94% of the issues in the original SWE-bench and their pull requests were created prior to the training cut-off dates of current frontier models ². The same study isolates leakage diagnostically: state-of-the-art models identify the buggy file path from the issue text alone with up to 76% accuracy on Verified, but only ~53% on issues from repositories absent from the benchmark — a gap consistent with memorisation rather than reasoning over the codebase ³. In an internal audit, OpenAI reports that every frontier model it tested could reproduce verbatim gold patches and problem details for some Verified tasks, indicating direct training exposure (OpenAI 2026). The standard mitigation is temporal decontamination: pipelines such as SWE-rebench continuously collect tasks created after model release dates and filter accordingly ⁴, and successor variants (e.g. "Pro"/private hold-outs) exist precisely so that headline numbers can be re-grounded on unseen code.

Critiques & limitations

The "Verified" subset was built to fix documented defects in the original SWE-bench: OpenAI engaged 93 professional Python developers to screen 1,699 sampled instances for underspecified issue text and over-strict tests that reject valid fixes, yielding the 500-task curated set (OpenAI 2024). Curation did not, however, resolve two deeper construct-validity problems. SWE-Bench+ finds that 32.67% of successful patches involve solution leakage — the fix appears in the issue report or comments — and that 31.08% of passing patches are "suspicious" because the test oracle is too weak to confirm correctness; both pathologies persist in SWE-bench Verified, not only in Lite ². When problematic items are removed, SWE-agent+GPT-4's resolution rate collapses from 12.47% to 3.97%, implying that a substantial share of headline performance is attributable to flawed items rather than genuine issue-resolution ². OpenAI's own audit of 138 tasks o3 failed across 64 runs, each reviewed by at least six experienced software engineers, found 59.4% contained flawed tests — 35.5% over-strict and 18.8% testing unspecified behaviour (OpenAI 2026). Such item-level fragility is why the capability-evaluation literature stresses that headline eval results should be interpreted cautiously and audited against the underlying items rather than taken at face value ⁵. In Policy Window's composite reading, a Verified score conflates true SWE capability with leakage and oracle artefacts; the construct it nominally measures — autonomously resolving real multi-file issues ⁶ — is only loosely identified by the score.

Saturation & score trajectory

The benchmark moved from near-floor to near-ceiling in roughly two years, a trajectory that itself motivated its retirement. In the original SWE-bench paper the strongest system, Claude 2, resolved only 1.96% of issues, and the authors noted that even frontier proprietary models solved "only the simplest issues" given the need to coordinate edits across multiple functions, classes, and files ⁶. The Verified curation reset the baseline upward — at launch GPT-4o resolved 33.2% of Verified tasks versus ~16% on the uncurated set, which OpenAI framed as evidence the original underestimated capability (OpenAI 2024). Vendor model cards subsequently reported figures in the high-70s, and the claim tracked on this article (claude-opus-4-7, 78.4%, 2025-05-22) sits in that band; the stakes of getting such a score right are high because software development is among the work-task domains most exposed to LLM automation ⁷. As scores compressed toward the top of the 0–100% range, the headline lost discriminative power, and the combination of saturation with the contamination and test-flaw evidence above led OpenAI to stop reporting Verified and recommend others do the same, pointing to harder, decontaminated successors (OpenAI 2026). The lesson Policy Window draws — an editorial reading — is that a metric near ceiling is best treated as a lower bound on the easiest tasks rather than a frontier-capability signal, since rankings among near-saturated systems are dominated by the artefacts the prior section documents; this is consistent with the broader eval literature's caution against reading capability headlines at face value ⁵.

Results & interpretation

Claimed scores

Model	Score	Claim type	Reported	Citation
claude-opus-4-7	78.4 % solved	vendor card	2025-05-22	Anthropic model card

How to read this number

Contamination risk: medium

Some test items may leak into training corpora; treat headline scores with mild skepticism and prefer evaluation runs with held-out subsets.

What a high score does and does not establish. A score evidences performance on this benchmark’s specific construct under its specific format; it is not, on its own, evidence of general capability, reliable real-world task performance, or safety. This benchmark is saturated, so small differences near the ceiling no longer reliably separate frontier from mid-tier systems.

The second silence. evidence: thin The evidence that a benchmark score predicts real-world deployment outcomes (construct-to-deployment validity) is sparse; benchmark performance and deployed performance are not established to be the same thing, and contamination can inflate the headline figure above true held-out ability.

Governance relevance

A benchmark measures a capability; governance attaches to the topicsthat capability bears on. These topic articles carry the instrument×dimension coverage matrix and the social-science so-what for this domain.

Foundation Models / GPAI— coverage matrix + does-governance-work evidence
Agentic AI Governance— coverage matrix + does-governance-work evidence

References

Sources cited inline in the analysis (linked from the superscript markers), then the primary instrument sources behind the classifications.

Maksym Andriushchenko, Alexandra Souly, Mateusz Dziemian, Derek Duenas, Maxwell Lin, Justin Wang, Dan Hendrycks, Andy Zou, Zico Kolter, Matt Fredrikson, Eric Winsor, Jerome Wynne, Yarin Gal, Xander Davies (UK AISI / Gray Swan) (2025) AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents, ICLR 2025. arXiv:2410.09024 — Provides a 440-task benchmark across 11 harm categories measuring whether LLM agents resist or comply with harmful multi-step tool-use tasks, grounding safety-evaluation regimes for agents. ↩
arXiv:2410.06992 ↩
arXiv:2506.12286 ↩
arXiv:2505.20411 ↩
Mary Phuong, Matthew Aitchison, Elliot Catt, et al. (Google DeepMind) (2024) Evaluating Frontier Models for Dangerous Capabilities, arXiv (cs.LG). arXiv:2403.13793 — Pilots dangerous-capability evaluations (persuasion, cyber, self-proliferation) on frontier models, finding 'early warning signs' but no strong present danger — grounding evaluation-based gating. ↩
arXiv:2310.06770 ↩
Eloundou, Manning, Mishkin, Rock (2024) GPTs are GPTs: Labor market impact potential of LLMs, Science. 10.1126/science.adj0998 — Finds around 80% of the U.S. workforce "could have at least 10% of their work tasks affected" by LLMs, which exhibit "traits of general-purpose technologies". ↩
SWE-bench Verified methodology
claude-opus-4-7 — 78.4 % solved (Anthropic model card, 2025-05-22)

How to cite this benchmark

Use the primary methodology source for academic citations; reference the Policy Window article for the cross-model leaderboard.

Cite this article 8 formats · BibTeX, RIS, APA, Chicago, … · 1-click copy

@misc{policywindow-swe-bench-verified,
  title  = {SWE-bench Verified},
  author = {Policy Window},
  year   = {2024},
  howpublished = {SWE-BENCH-VER (2024)},
  url    = {https://policywindow.org/wiki/swe-bench-verified},
  note   = {Primary source: https://openai.com/index/introducing-swe-bench-verified/}
}

Verify the year + paste-and-refine. Primary source linked in BibTeX/RIS note.

Permalink downloads.bib .ris .csl.json

Persistent identifier: https://policywindow.org/wiki/swe-bench-verified — committed-stable URL with content-versioning via ?asOf= (rollout pending per methodology §7). DOIs via Zenodo are on the roadmap.

Article tools — track changes, suggest an edit

View history — every captured revision of this article · What links here

Source: Edit on GitHub (search for `SWE-BENCH-VER`)

[ref-1] Maksym Andriushchenko, Alexandra Souly, Mateusz Dziemian, Derek Duenas, Maxwell Lin, Justin Wang, Dan Hendrycks, Andy Zou, Zico Kolter, Matt Fredrikson, Eric Winsor, Jerome Wynne, Yarin Gal, Xander Davies (UK AISI / Gray Swan) (2025) AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents, ICLR 2025. arXiv:2410.09024 — Provides a 440-task benchmark across 11 harm categories measuring whether LLM agents resist or comply with harmful multi-step tool-use tasks, grounding safety-evaluation regimes for agents. ↩

[ref-2] arXiv:2410.06992 ↩

[ref-3] arXiv:2506.12286 ↩

[ref-4] arXiv:2505.20411 ↩

[ref-5] Mary Phuong, Matthew Aitchison, Elliot Catt, et al. (Google DeepMind) (2024) Evaluating Frontier Models for Dangerous Capabilities, arXiv (cs.LG). arXiv:2403.13793 — Pilots dangerous-capability evaluations (persuasion, cyber, self-proliferation) on frontier models, finding 'early warning signs' but no strong present danger — grounding evaluation-based gating. ↩

[ref-6] arXiv:2310.06770 ↩

[ref-7] Eloundou, Manning, Mishkin, Rock (2024) GPTs are GPTs: Labor market impact potential of LLMs, Science. 10.1126/science.adj0998 — Finds around 80% of the U.S. workforce "could have at least 10% of their work tasks affected" by LLMs, which exhibit "traits of general-purpose technologies". ↩

[ref-8] SWE-bench Verified methodology

[ref-9] claude-opus-4-7 — 78.4 % solved (Anthropic model card, 2025-05-22)

SWE-bench Verified

What it measures

Contamination & gaming

Critiques & limitations

Saturation & score trajectory

Results & interpretation

Claimed scores

How to read this number

Governance relevance

Further reading

References

How to cite this benchmark

SWE-bench Verified

What it measures

Contamination & gaming

Critiques & limitations

Saturation & score trajectory

Results & interpretation

Claimed scores

How to read this number

Governance relevance

Further reading

References

How to cite this benchmark