HumanEval
A code-generation benchmark from OpenAI: 164 Python programming problems with unit tests, used to measure an LLM's ability to generate correct code from a natural-language description.
HumanEval is the oldest widely-cited coding benchmark. Each problem includes a function signature, a docstring describing what the function should do, and a set of hidden unit tests. The model generates a function; pass@1 measures whether the first attempt passes all tests.
Frontier models in 2026 score 90–98% on HumanEval pass@1. The benchmark is essentially saturated — it no longer differentiates frontier models from each other. Use it as a sanity check (anything under 70% is suspect), not as a buying signal.
For real coding-agent evaluation, SWE-bench Verified is the better contemporary benchmark. It measures end-to-end issue resolution on real GitHub repos, which is a much harder and more relevant task than generating single functions.
Where this shows up
Frequently asked
Is HumanEval still relevant in 2026?+
Marginally. As a floor — under 70% is a red flag for a coding model. As a buying signal — no, frontier models are all at 90%+ and the benchmark is saturated.
What replaced HumanEval as the coding benchmark?+
SWE-bench Verified for end-to-end issue resolution. LiveCodeBench for continually-refreshed competitive programming. BigCodeBench for production-grade Python tasks with realistic dependencies.