Humanity's Last Exam

Policy Window Editorial Board

Humanity's Last Exam

HLE · Knowledge

Live · 2025

Tools

Last verified 2026-06-21

Cite Share PDF

What it measures

3,000+ frontier-difficulty expert-curated questions across all academic disciplines. Designed to remain unsaturated through 2026+.

Center for AI Safety + Scale AI collaboration. Frontier models 8-22% at launch. Replaces MMLU as the de-facto knowledge ceiling. Currency (2026-06-21): HLE SOTA has climbed past the article's framing — Artificial Analysis (June 2026) shows Claude Fable 5 ~53.3%, Claude Opus 4.8 ~45.7%, Gemini 3.1 Pro Preview ~44.7% (no-tools), and with-search results (e.g. Qwen3-Max Thinking) report ~49-58%; the article's top milestone (Gemini 3 Pro Preview 37.5%, "well above 40%") now understates the frontier and "active/unsaturated through 2026+" is strained.

Saturation and score trajectory

At launch (January 2025) HLE behaved as intended as a knowledge ceiling: reasoning-tuned frontier models clustered in the single digits, with OpenAI o1 at 9.1% and DeepSeek-R1 at 9.4% on the full benchmark, while non-reasoning models such as GPT-4o (3.3%) and Gemini 1.5 Pro (5.0%) scored lower still ¹. The first large jump came not from a larger base model but from tool use: OpenAI's agentic Deep Research, browsing autonomously for minutes per question, reached 26.6% in February 2025 — roughly a threefold gain over the best non-tool score at the time (OpenAI 2025-02-02). Pure-model scores then climbed more gradually through 2025–2026 as reasoning training matured rather than as raw scale increased, a regime the compute-optimal Chinchilla finding had already flagged by showing that model size and training tokens should scale together ².

The trajectory matters for how the number should be read. The pre-HLE expectation that capability tracks a smooth power-law in model size, data, and compute ³ under-predicts abrupt benchmark-specific jumps of the kind seen here: a test explicitly engineered to last "through 2026+" moved from under 10% to well above 40% within about eighteen months, consistent with the observation that some abilities emerge above a scale threshold and "would not have been directly predicted by extrapolating" smaller models ⁴. As scores rise, two things happen simultaneously — the remaining headroom shrinks, and the share of the score attributable to format effects, tool access, or item defects (see below) grows relative to genuine new capability, which complicates clean year-over-year comparison.

Contamination and gaming

HLE was designed against two failure modes that have eroded older knowledge benchmarks: training-data contamination and benchmark hacking. Items are curated to have a single unambiguous, verifiable answer that nonetheless "cannot be quickly answered via internet retrieval," which is intended to keep questions out of the easy reach of web-scraped pretraining corpora ¹. The most consequential anti-gaming measure is structural: alongside the publicly released questions, the maintainers hold out a private test set so that overfitting to the public split can be detected by comparing public and held-out accuracy (Phan et al. 2025). This is the design rationale behind treating the public leaderboard number as an upper bound rather than a clean held-out estimate, and it mirrors broader proposals to evaluate frontier systems under controlled conditions before judging their capabilities ⁵.

The public/private split, however, does not neutralise capability gained through retrieval at inference time. The February 2025 Deep Research result (26.6%) was achieved by an agent that browses live sources, so part of that score reflects search rather than parametric knowledge — a deployment regime in which capability is mediated by cloud-served access rather than by what the model alone has memorised ⁶. Leaderboards have responded by separating regimes — Scale AI's SEAL board, for example, reports a distinct "Text Only" track to isolate format and modality effects (Scale AI SEAL leaderboard). Readers comparing HLE numbers should therefore confirm three things before treating two scores as comparable: whether tools/browsing were enabled, whether the figure is on the public or held-out set, and whether multimodal items (about 10% of the corpus) were included (Phan et al. 2025).

Critiques and limitations

HLE's headline difficulty has been shown to rest partly on flawed items, which biases scores in hard-to-sign ways. An independent FutureHouse study used a literature-grounded agent (PaperQA2) plus expert adjudication over a sample of text-only biology/health and chemistry questions and estimated that 29.3% +/- 3.7% of official HLE answers in those domains are directly contradicted by peer-reviewed literature, 51.3% are supported, and 19.3% are "nuanced" and assumption-dependent; the group released a vetted subset (HLE-Gold-Bio/Chem) for researchers who want a cleaner evaluation set (FutureHouse 2025). The authors attribute many defects to the adversarial design incentive — rewarding questions current models fail can select for under-specified "gotcha" items, and reviewers were not required to fully verify a rationale taking over five minutes.

A larger systematic audit, HLE-Verified, examined all 2,500 public questions and classified only 668 as correct as-written, repaired 1,143 flawed-but-fixable items, and left 689 as indeterminate — i.e. roughly three-quarters carried some error or ambiguity — using a 19-category taxonomy of problem-, rationale-, and answer-level defects, with incorrect answers dominating the answer-level errors ⁷. Such label noise does not stay contained: because a benchmark's defects propagate to every model tuned or ranked against it, much as "the defects of the foundation model are inherited by all the adapted models downstream" ⁸, absolute accuracy figures — especially small gaps between top models — should be read with a non-trivial label-noise floor, and domain-level comparisons are safest on verified subsets.

Results & interpretation

Claimed scores

Model	Score	Claim type	Reported	Citation
gpt-5	22.1 % accuracy	vendor card	2025-08-07	OpenAI release

How to read this number

Contamination risk: low

Benchmark items are unlikely to appear in training corpora — scores are credible reflections of underlying capability.

What a high score does and does not establish. A score evidences performance on this benchmark’s specific construct under its specific format; it is not, on its own, evidence of general capability, reliable real-world task performance, or safety.

The second silence. evidence: thin The evidence that a benchmark score predicts real-world deployment outcomes (construct-to-deployment validity) is sparse; benchmark performance and deployed performance are not established to be the same thing, and contamination can inflate the headline figure above true held-out ability.

Governance relevance

A benchmark measures a capability; governance attaches to the topicsthat capability bears on. These topic articles carry the instrument×dimension coverage matrix and the social-science so-what for this domain.

Foundation Models / GPAI— coverage matrix + does-governance-work evidence

References

Sources cited inline in the analysis (linked from the superscript markers), then the primary instrument sources behind the classifications.

arXiv:2501.14249 ↩
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, et al. (DeepMind) (2022) Training Compute-Optimal Large Language Models, arXiv (cs.CL); NeurIPS 2022. arXiv:2203.15556 — The 'Chinchilla' study shows 'model size and the number of training tokens should be scaled equally', complicating compute-only regulatory thresholds. ↩
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, Dario Amodei (2020) Scaling Laws for Neural Language Models, arXiv (cs.LG). arXiv:2001.08361 — Establishes that model 'loss scales as a power-law with model size, dataset size, and the amount of compute', the empirical basis for compute-threshold regulation of foundation models. ↩
Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, et al. (2022) Emergent Abilities of Large Language Models, arXiv (cs.CL) / TMLR. arXiv:2206.07682 — Documents 'emergent abilities' that appear only above a scale threshold and 'would not have been directly predicted by extrapolating' smaller models — a core governance unpredictability problem. ↩
Mary Phuong, Matthew Aitchison, Elliot Catt, et al. (Google DeepMind) (2024) Evaluating Frontier Models for Dangerous Capabilities, arXiv (cs.LG). arXiv:2403.13793 — Pilots dangerous-capability evaluations (persuasion, cyber, self-proliferation) on frontier models, finding 'early warning signs' but no strong present danger — grounding evaluation-based gating. ↩
Toby Shevlane (2022) Structured access: an emerging paradigm for safe AI deployment, arXiv (cs.CY); The Oxford Handbook of AI Governance. arXiv:2201.05159 — Proposes controlled, cloud-mediated 'structured access' to 'prevent dangerous AI capabilities from being widely accessible, whilst preserving access to AI capabilities that can be used safely'. ↩
arXiv:2602.13964 ↩
Bommasani et al. (2021) On the Opportunities and Risks of Foundation Models, arXiv. arXiv:2108.07258 — Defines foundation models and warns homogenization "demands caution, as the defects of the foundation model are inherited by all the adapted models downstream". ↩
Humanity's Last Exam methodology
gpt-5 — 22.1 % accuracy (OpenAI release, 2025-08-07)

How to cite this benchmark

Use the primary methodology source for academic citations; reference the Policy Window article for the cross-model leaderboard.

Cite this article 8 formats · BibTeX, RIS, APA, Chicago, … · 1-click copy

@misc{policywindow-humanitys-last-exam,
  title  = {Humanity's Last Exam},
  author = {Policy Window},
  year   = {2025},
  howpublished = {HLE (2025)},
  url    = {https://policywindow.org/wiki/humanitys-last-exam},
  note   = {Primary source: https://lastexam.ai/}
}

Verify the year + paste-and-refine. Primary source linked in BibTeX/RIS note.

Permalink downloads.bib .ris .csl.json

Persistent identifier: https://policywindow.org/wiki/humanitys-last-exam — committed-stable URL with content-versioning via ?asOf= (rollout pending per methodology §7). DOIs via Zenodo are on the roadmap.

Article tools — track changes, suggest an edit

View history — every captured revision of this article · What links here

Source: Edit on GitHub (search for `HLE`)

Humanity's Last Exam

HLE · Knowledge

Live · 2025

Tools

Last verified 2026-06-21

Cite Share PDF

What it measures

3,000+ frontier-difficulty expert-curated questions across all academic disciplines. Designed to remain unsaturated through 2026+.

Saturation and score trajectory

Contamination and gaming

Critiques and limitations

Results & interpretation

Claimed scores

Model	Score	Claim type	Reported	Citation
gpt-5	22.1 % accuracy	vendor card	2025-08-07	OpenAI release

How to read this number

Contamination risk: low

Benchmark items are unlikely to appear in training corpora — scores are credible reflections of underlying capability.

Governance relevance

Foundation Models / GPAI— coverage matrix + does-governance-work evidence

References

Sources cited inline in the analysis (linked from the superscript markers), then the primary instrument sources behind the classifications.

arXiv:2501.14249 ↩
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, et al. (DeepMind) (2022) Training Compute-Optimal Large Language Models, arXiv (cs.CL); NeurIPS 2022. arXiv:2203.15556 — The 'Chinchilla' study shows 'model size and the number of training tokens should be scaled equally', complicating compute-only regulatory thresholds. ↩
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, Dario Amodei (2020) Scaling Laws for Neural Language Models, arXiv (cs.LG). arXiv:2001.08361 — Establishes that model 'loss scales as a power-law with model size, dataset size, and the amount of compute', the empirical basis for compute-threshold regulation of foundation models. ↩
Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, et al. (2022) Emergent Abilities of Large Language Models, arXiv (cs.CL) / TMLR. arXiv:2206.07682 — Documents 'emergent abilities' that appear only above a scale threshold and 'would not have been directly predicted by extrapolating' smaller models — a core governance unpredictability problem. ↩
Mary Phuong, Matthew Aitchison, Elliot Catt, et al. (Google DeepMind) (2024) Evaluating Frontier Models for Dangerous Capabilities, arXiv (cs.LG). arXiv:2403.13793 — Pilots dangerous-capability evaluations (persuasion, cyber, self-proliferation) on frontier models, finding 'early warning signs' but no strong present danger — grounding evaluation-based gating. ↩
Toby Shevlane (2022) Structured access: an emerging paradigm for safe AI deployment, arXiv (cs.CY); The Oxford Handbook of AI Governance. arXiv:2201.05159 — Proposes controlled, cloud-mediated 'structured access' to 'prevent dangerous AI capabilities from being widely accessible, whilst preserving access to AI capabilities that can be used safely'. ↩
arXiv:2602.13964 ↩
Bommasani et al. (2021) On the Opportunities and Risks of Foundation Models, arXiv. arXiv:2108.07258 — Defines foundation models and warns homogenization "demands caution, as the defects of the foundation model are inherited by all the adapted models downstream". ↩
Humanity's Last Exam methodology
gpt-5 — 22.1 % accuracy (OpenAI release, 2025-08-07)

How to cite this benchmark

Use the primary methodology source for academic citations; reference the Policy Window article for the cross-model leaderboard.

Cite this article 8 formats · BibTeX, RIS, APA, Chicago, … · 1-click copy

@misc{policywindow-humanitys-last-exam,
  title  = {Humanity's Last Exam},
  author = {Policy Window},
  year   = {2025},
  howpublished = {HLE (2025)},
  url    = {https://policywindow.org/wiki/humanitys-last-exam},
  note   = {Primary source: https://lastexam.ai/}
}

Verify the year + paste-and-refine. Primary source linked in BibTeX/RIS note.

Permalink downloads.bib .ris .csl.json

Article tools — track changes, suggest an edit

View history — every captured revision of this article · What links here

Source: Edit on GitHub (search for `HLE`)

[ref-1] arXiv:2501.14249 ↩

[ref-2] Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, et al. (DeepMind) (2022) Training Compute-Optimal Large Language Models, arXiv (cs.CL); NeurIPS 2022. arXiv:2203.15556 — The 'Chinchilla' study shows 'model size and the number of training tokens should be scaled equally', complicating compute-only regulatory thresholds. ↩

[ref-3] Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, Dario Amodei (2020) Scaling Laws for Neural Language Models, arXiv (cs.LG). arXiv:2001.08361 — Establishes that model 'loss scales as a power-law with model size, dataset size, and the amount of compute', the empirical basis for compute-threshold regulation of foundation models. ↩

[ref-4] Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, et al. (2022) Emergent Abilities of Large Language Models, arXiv (cs.CL) / TMLR. arXiv:2206.07682 — Documents 'emergent abilities' that appear only above a scale threshold and 'would not have been directly predicted by extrapolating' smaller models — a core governance unpredictability problem. ↩

[ref-5] Mary Phuong, Matthew Aitchison, Elliot Catt, et al. (Google DeepMind) (2024) Evaluating Frontier Models for Dangerous Capabilities, arXiv (cs.LG). arXiv:2403.13793 — Pilots dangerous-capability evaluations (persuasion, cyber, self-proliferation) on frontier models, finding 'early warning signs' but no strong present danger — grounding evaluation-based gating. ↩

[ref-6] Toby Shevlane (2022) Structured access: an emerging paradigm for safe AI deployment, arXiv (cs.CY); The Oxford Handbook of AI Governance. arXiv:2201.05159 — Proposes controlled, cloud-mediated 'structured access' to 'prevent dangerous AI capabilities from being widely accessible, whilst preserving access to AI capabilities that can be used safely'. ↩

[ref-7] arXiv:2602.13964 ↩

[ref-8] Bommasani et al. (2021) On the Opportunities and Risks of Foundation Models, arXiv. arXiv:2108.07258 — Defines foundation models and warns homogenization "demands caution, as the defects of the foundation model are inherited by all the adapted models downstream". ↩

[ref-9] Humanity's Last Exam methodology

[ref-10] gpt-5 — 22.1 % accuracy (OpenAI release, 2025-08-07)

Humanity's Last Exam

What it measures

Saturation and score trajectory

Contamination and gaming

Critiques and limitations

Results & interpretation

Claimed scores

How to read this number

Governance relevance

Further reading

References

How to cite this benchmark

Humanity's Last Exam

What it measures

Saturation and score trajectory

Contamination and gaming

Critiques and limitations

Results & interpretation

Claimed scores

How to read this number

Governance relevance

Further reading

References

How to cite this benchmark