HumanEval

Policy Window Editorial Board

HumanEval

HUMANEVAL · Code generation

Deprecated · 2021

Tools

Last verified 2026-06-21

Cite Share PDF

This benchmark is deprecated — for frontier evaluation, consult SWE-bench Verified.

What it measures

164 hand-written Python programming problems. Generate a function that passes provided unit tests.

Saturated — top models ~95%. Largely superseded by SWE-bench for real-world relevance. Currency (2026-06-21): Verified-current — HumanEval remains saturated/deprecated as the article states; current pass@1 leaders (o4-mini ~97.3%, o3 ~97%, Claude Opus 4.6 ~96.3%) sit at the ceiling, consistent with the article's ">90% / ~95%" framing, and the field continues migrating to contamination-resistant successors (SWE-bench Verified/Pro, LiveCodeBench). No stale figure or claim.

Construct & what it actually measures

HumanEval operationalises code-generation ability as a narrow, well-bounded task: given a Python function signature and a natural-language docstring, the model must emit a function body that passes a small set of held-out unit tests, scored by the pass@k estimator ¹. This construct is deliberately self-contained. Each of the 164 problems is short, single-function, algorithmic, and dependency-free, with a fully specified input/output contract. That design buys clean, executable, deterministic grading, but it also fixes the ceiling of what a score can certify.

The gap between the named construct ("code generation") and the measured construct ("function completion from a complete docstring under hidden tests") is wide and documented. HumanEval does not exercise multi-file reasoning, repository context, dependency resolution, debugging of existing code, or specification ambiguity — the dimensions that dominate real software work and that successor suites such as SWE-bench were built to probe ². It also conflates two abilities that governance cares about separately: understanding intent and producing correct logic. Because grading is purely functional, stylistic quality, security, and efficiency are invisible to the metric. A high pass@1 therefore licenses the claim "this system completes short, fully specified Python functions," not the broader "this system can engineer software" — a distinction the Policy Window leaderboard preserves but that aggregated headline numbers routinely elide.

Saturation & score trajectory

HumanEval's trajectory is the canonical illustration of benchmark saturation. The original Codex model scored 28.8% pass@1 at release, against 0% for GPT-3 and 11.4% for GPT-J, with pass@100 reaching 70.2% — i.e. the model could often produce a correct answer if allowed many samples but rarely on the first try ¹. Within two years GPT-4 reported 67.0% zero-shot pass@1 ³, and independent re-runs under different harnesses placed it near 88% on the base set ⁴. By 2024-2026 frontier systems are widely reported above 90%, clustering against the ceiling.

Note that harness and prompting differences alone move the figure by ~20 points (67% reported vs ~88% re-run for the same model), so cross-report comparison is fragile. Saturation has a concrete consequence for evaluation: once the leaders sit at 95-97%, the residual headroom is dominated by the benchmark's own label noise and ambiguous items rather than by capability differences, so small gaps near the top no longer reliably separate systems. This is why HumanEval is now treated as a regression check rather than a frontier discriminator, and why the field migrated to harder, contamination-resistant successors. The Policy Window article's "deprecated" status and saturation caveat encode exactly this: a near-ceiling HumanEval score is evidence the floor has been cleared, not evidence of frontier standing.

HumanEval pass@1 progression on primary-reported figures. Codex/GPT-3 are from the original benchmark paper (arXiv:2107.03374); GPT-4 (report) is the zero-shot figure from the GPT-4 Technical Report (arXiv:2303.08774); GPT-4 (EvalPlus harness) is the independently re-run base figure from arXiv:2305.01210. Later >90% figures are vendor/aggregator-reported and not independently re-verified here.
Model	Year	Reported pass@1	Source / status
GPT-3 (zero-shot)	2021	0%	Original paper (2107.03374)
Codex (12B)	2021	28.8%	Original paper, pass@1; pass@100 = 70.2%
GPT-4 (zero-shot)	2023	67.0%	GPT-4 Technical Report (2303.08774)
GPT-4 (EvalPlus re-run, base)	2023	~88%	Independent harness, arXiv:2305.01210
Frontier models	2024-2026	>90% (saturated)	Vendor/aggregator-reported; attributed as such

Contamination & gaming

HumanEval carries one of the highest contamination risks of any widely cited benchmark, for a structural reason: it has been public on GitHub since 2021, so its prompts and reference solutions are almost certainly inside the pretraining and instruction-tuning corpora of any modern model. A keyword search by Matton et al. ⁵ found every HumanEval prompt replicated on public GitHub, with a median of 99 hits and a minimum of 43, and showed that adding the synthetic evol-instruct dataset raised one model's HumanEval pass@1 by 14 absolute points (0.52 to 0.66) while barely moving MBPP — a signature of indirect leakage through synthetic data pipelines. Riddell, Ni & Cohan ⁶ quantified direct overlap, finding exact-match solutions for 12.2% of HumanEval problems in the Pile and 18.9% in the Stack, using surface-level (Levenshtein edit-distance) and semantic (AST-based, via Dolos) similarity detection. Matton et al. distinguish three contamination channels: direct leakage, indirect leakage via synthetic data, and overfitting to the test set during model selection ⁵.

The headline figure is further inflated by the benchmark's own weakness: its hidden tests are too sparse to catch subtle bugs. The EvalPlus/HumanEval+ work extended the test suite roughly 80x and found that pass@k dropped by up to 19.3-28.9% across 26 models, also documenting 18 defects (11% of problems) in HumanEval's own ground truth ⁴. HumanEval+ exists precisely as the rigorous variant — the code-generation analogue of a "Verified" set — and re-ranks several models relative to the base benchmark, underscoring that base-HumanEval rankings can be artefacts of weak grading rather than true capability.

Results & interpretation

Claimed scores

No claims have been recorded yet for this benchmark in the Policy Window catalog.

How to read this number

Contamination risk: high

Test-set leakage is widely documented; scores near saturation should not be treated as evidence of generalization. Prefer harder successor benchmarks.

What a high score does and does not establish. A score evidences performance on this benchmark’s specific construct under its specific format; it is not, on its own, evidence of general capability, reliable real-world task performance, or safety. This benchmark is saturated, so small differences near the ceiling no longer reliably separate frontier from mid-tier systems.

The second silence. evidence: thin The evidence that a benchmark score predicts real-world deployment outcomes (construct-to-deployment validity) is sparse; benchmark performance and deployed performance are not established to be the same thing, and contamination can inflate the headline figure above true held-out ability.

Governance relevance

A benchmark measures a capability; governance attaches to the topicsthat capability bears on. These topic articles carry the instrument×dimension coverage matrix and the social-science so-what for this domain.

Foundation Models / GPAI— coverage matrix + does-governance-work evidence

References

Sources cited inline in the analysis, numbered in order of appearance.

How to cite this benchmark

Use the primary methodology source for academic citations; reference the Policy Window article for the cross-model leaderboard.

Cite this article 8 formats · BibTeX, RIS, APA, Chicago, … · 1-click copy

@misc{policywindow-humaneval,
  title  = {HumanEval},
  author = {Policy Window},
  year   = {2021},
  howpublished = {HUMANEVAL (2021)},
  url    = {https://policywindow.org/wiki/humaneval},
  note   = {Primary source: https://arxiv.org/abs/2107.03374}
}

Verify the year + paste-and-refine. Primary source linked in BibTeX/RIS note.

Permalink downloads.bib .ris .csl.json

Persistent identifier: https://policywindow.org/wiki/humaneval — committed-stable URL with content-versioning via ?asOf= (rollout pending per methodology §7). DOIs via Zenodo are on the roadmap.

Article tools — track changes, suggest an edit

View history — every captured revision of this article · What links here

Source: Edit on GitHub (search for `HUMANEVAL`)

[ref-1] arXiv:2107.03374 ↩

[ref-2] arXiv:2310.06770 ↩

[ref-3] arXiv:2303.08774 ↩

[ref-4] arXiv:2305.01210 ↩

[ref-5] arXiv:2407.07565 ↩

[ref-6] arXiv:2403.04811 ↩

HumanEval

What it measures

Construct & what it actually measures

Saturation & score trajectory

Contamination & gaming

Results & interpretation

Claimed scores

How to read this number

Governance relevance

Further reading

References

How to cite this benchmark

HumanEval

What it measures

Construct & what it actually measures

Saturation & score trajectory

Contamination & gaming

Results & interpretation

Claimed scores

How to read this number

Governance relevance

Further reading

References

How to cite this benchmark