?asOf= parameter to see the current catalog state.HumanEval is a code generation benchmark published in 2021 measuring 164 hand-written Python programming problems. Generate a function that passes provided unit tests. Contamination risk: high.
This benchmark is deprecated — for frontier evaluation, consult SWE-bench Verified.
What this benchmark measures
164 hand-written Python programming problems. Generate a function that passes provided unit tests.
Saturated — top models ~95%. Largely superseded by SWE-bench for real-world relevance.
Claimed scores
No claims have been recorded yet for this benchmark in the Policy Window catalog.
Interpretation guidance
Contamination risk: high
Test-set leakage is widely documented; scores near saturation should not be treated as evidence of generalization. Prefer harder successor benchmarks.
How to cite this benchmark
Use the primary methodology source for academic citations; reference the Policy Window article for the cross-model leaderboard.
- Primary methodology:https://arxiv.org/abs/2107.03374
- Wiki article:
https://policywindow.org/wiki/humaneval
References
Cite this article
6 formats · 1-click copyPersistent identifier: https://policywindow.org/wiki/humaneval — committed-stable URL with content-versioning via ?asOf= (rollout pending per methodology §7). DOIs via Zenodo are on the roadmap.
Take this further — sign up free
Save, compare, or get alerts when HumanEval changes. Policy Window is the analyst workbench layered on top of this wiki — free for researchers, civil society, and verified policymakers.