HumanEval
HUMANEVAL · Code generation
HumanEval is a code generation benchmark published in 2021 measuring 164 hand-written Python programming problems. Generate a function that passes provided unit tests. Contamination risk: high.
What this benchmark measures
164 hand-written Python programming problems. Generate a function that passes provided unit tests.
Saturated — top models ~95%. Largely superseded by SWE-bench for real-world relevance.
Claimed scores
No claims have been recorded yet for this benchmark in the Policy Window catalog.
Interpretation guidance
Contamination risk: high
Test-set leakage is widely documented; scores near saturation should not be treated as evidence of generalization. Prefer harder successor benchmarks.
How to cite this benchmark
Use the primary methodology source for academic citations; reference the Policy Window article for the cross-model leaderboard.
- Primary methodology:https://arxiv.org/abs/2107.03374
- Wiki article:
https://policywindow.org/wiki/humaneval
References
Take this further — sign up free
Save, compare, or get alerts when HumanEval changes. Policy Window is the analyst workbench layered on top of this wiki — free for researchers, civil society, and verified policymakers.