Print-friendly view · use your browser's Save as PDF option (Cmd/Ctrl-P) to attach this article to a brief.
SWE-bench Verified
SWE-BENCH-VER · agentic benchmark · 2024
Source: https://policywindow.org/wiki/swe-bench-verified
Generated 2026-05-30T22:11:16 UTC
Summary
Solve real-world GitHub issues from 12 popular Python repos. The 'Verified' subset is human-validated to remove ambiguity and have working tests.
At a glance
- Score range
- 0–100 % solved
- Contamination risk
- medium
- Methodology URL
- https://openai.com/index/introducing-swe-bench-verified/
- Saturation status
- active
Details
500-task verified subset. Run-time evaluation; can't be gamed by pure memorisation but agent harness affects results.
How to cite this article
APA
Policy Window. (2024). SWE-bench Verified [Wiki article — Benchmark]. https://policywindow.org/wiki/swe-bench-verified
Chicago
Policy Window. 2024. "SWE-bench Verified." Wiki article (Benchmark). https://policywindow.org/wiki/swe-bench-verified.
Harvard
Policy Window (2024) 'SWE-bench Verified', Wiki article — Benchmark, available at: https://policywindow.org/wiki/swe-bench-verified.
OSCOLA
Policy Window, 'SWE-bench Verified' (Wiki article — Benchmark, 2024) <https://policywindow.org/wiki/swe-bench-verified> accessed [date].
BibTeX
@misc{policywindow-swe-bench-verified,
title = {SWE-bench Verified},
author = {Policy Window},
year = {2024},
howpublished = {SWE-BENCH-VER (2024)},
url = {https://policywindow.org/wiki/swe-bench-verified},
note = {Primary source: https://openai.com/index/introducing-swe-bench-verified/}
}