HumanEval

HUMANEVAL · Code generation

Live · 2021

HumanEval is a code generation benchmark published in 2021 measuring 164 hand-written Python programming problems. Generate a function that passes provided unit tests. Contamination risk: high.

What this benchmark measures

164 hand-written Python programming problems. Generate a function that passes provided unit tests.

Saturated — top models ~95%. Largely superseded by SWE-bench for real-world relevance.

Claimed scores

No claims have been recorded yet for this benchmark in the Policy Window catalog.

Interpretation guidance

Contamination risk: high

Test-set leakage is widely documented; scores near saturation should not be treated as evidence of generalization. Prefer harder successor benchmarks.

How to cite this benchmark

Use the primary methodology source for academic citations; reference the Policy Window article for the cross-model leaderboard.

References

  1. HumanEval methodology

Take this further — sign up free

Save, compare, or get alerts when HumanEval changes. Policy Window is the analyst workbench layered on top of this wiki — free for researchers, civil society, and verified policymakers.

Generated from the Policy Window catalog at . Each claim cites the originating primary source.

Wiki articles regenerate when the underlying catalog updates. Tracked revisions arrive in a future iteration; subscribe via the CTA above to be notified when this article changes.