HumanEval

Policy Window Editorial Board

Pinned snapshot. This article is rendered from the catalog state captured at 2026-05-29 (closest snapshot at-or-before your requested date of 2026-05-29). The live version may differ — drop the ?asOf= parameter to see the current catalog state.

HumanEval

HUMANEVAL · Code generation

Deprecated · 2021Editorial review pending

Cite Share PDF

HumanEval is a code generation benchmark published in 2021 measuring 164 hand-written Python programming problems. Generate a function that passes provided unit tests. Contamination risk: high.

This benchmark is deprecated — for frontier evaluation, consult SWE-bench Verified.

What this benchmark measures

164 hand-written Python programming problems. Generate a function that passes provided unit tests.

Saturated — top models ~95%. Largely superseded by SWE-bench for real-world relevance.

Claimed scores

No claims have been recorded yet for this benchmark in the Policy Window catalog.

Interpretation guidance

Contamination risk: high

Test-set leakage is widely documented; scores near saturation should not be treated as evidence of generalization. Prefer harder successor benchmarks.

How to cite this benchmark

Use the primary methodology source for academic citations; reference the Policy Window article for the cross-model leaderboard.

Primary methodology:https://arxiv.org/abs/2107.03374
Wiki article:https://policywindow.org/wiki/humaneval

References

HumanEval methodology

Cite this article

6 formats · 1-click copy

@misc{policywindow-humaneval,
  title  = {HumanEval},
  author = {Policy Window},
  year   = {2021},
  howpublished = {HUMANEVAL (2021)},
  url    = {https://policywindow.org/wiki/humaneval},
  note   = {Primary source: https://arxiv.org/abs/2107.03374}
}

Verify the year + paste-and-refine. Primary source linked in BibTeX/RIS note.

Persistent identifier: https://policywindow.org/wiki/humaneval — committed-stable URL with content-versioning via ?asOf= (rollout pending per methodology §7). DOIs via Zenodo are on the roadmap.

Take this further — sign up free

Save, compare, or get alerts when HumanEval changes. Policy Window is the analyst workbench layered on top of this wiki — free for researchers, civil society, and verified policymakers.

Save this article Get alerts on changes Compare with another article

Source: Edit on GitHub (search for `HUMANEVAL`)

Spotted a stale fact or missing source? Report a problem with this page →

HumanEval

HUMANEVAL · Code generation

Deprecated · 2021Editorial review pending

Cite Share PDF

HumanEval is a code generation benchmark published in 2021 measuring 164 hand-written Python programming problems. Generate a function that passes provided unit tests. Contamination risk: high.

This benchmark is deprecated — for frontier evaluation, consult SWE-bench Verified.

What this benchmark measures

164 hand-written Python programming problems. Generate a function that passes provided unit tests.

Saturated — top models ~95%. Largely superseded by SWE-bench for real-world relevance.

Claimed scores

No claims have been recorded yet for this benchmark in the Policy Window catalog.

Interpretation guidance

Contamination risk: high

Test-set leakage is widely documented; scores near saturation should not be treated as evidence of generalization. Prefer harder successor benchmarks.

How to cite this benchmark

Use the primary methodology source for academic citations; reference the Policy Window article for the cross-model leaderboard.

Primary methodology:https://arxiv.org/abs/2107.03374
Wiki article:https://policywindow.org/wiki/humaneval

References

HumanEval methodology

Cite this article

6 formats · 1-click copy

@misc{policywindow-humaneval,
  title  = {HumanEval},
  author = {Policy Window},
  year   = {2021},
  howpublished = {HUMANEVAL (2021)},
  url    = {https://policywindow.org/wiki/humaneval},
  note   = {Primary source: https://arxiv.org/abs/2107.03374}
}

Verify the year + paste-and-refine. Primary source linked in BibTeX/RIS note.

Take this further — sign up free

Save, compare, or get alerts when HumanEval changes. Policy Window is the analyst workbench layered on top of this wiki — free for researchers, civil society, and verified policymakers.

Save this article Get alerts on changes Compare with another article

Source: Edit on GitHub (search for `HUMANEVAL`)

Spotted a stale fact or missing source? Report a problem with this page →

[1] HumanEval methodology