Open problem 4
The Evaluation and Assurance Problem
- current AI
- frontier AI
- AGI
Are model evaluations, red-teaming, benchmarks, audits, and safety cases reliable enough to determine whether a system should be trained, deployed, restricted, or recalled?
Why it’s foundational
Frontier governance increasingly relies on evaluations. But if evaluations are invalid, gameable, too narrow, or too late in the development cycle, governance becomes theatre.
Why it’s difficult
Models can be scaffolded, fine-tuned, jailbroken, updated, or used in contexts far from the test setting. Pre-deployment evaluations often examine individual models, while many risks arise from training processes, internal use, safeguards, organisational culture, and deployment ecosystems. METR has explicitly noted that standard pre-deployment third-party evaluations have important limitations, including limited visibility into training and safeguards and little time before launch.
Hidden assumptions
The fashionable assumption is that “independent evaluation” equals assurance. It does not, unless evaluators have adequate access, time, authority, adversarial methods, publication rights, and protection from conflicts of interest.
Competing positions
- Benchmark-centred evaluation
- Red-team-centred evaluation
- Audit-centred governance
- Interpretability-centred assurance
- Safety-case regulation
- Scepticism that black-box evaluations can ever certify frontier systems
What could make progress
Longitudinal evaluation of models before and after deployment; evaluator access protocols; adversarially robust benchmarks; audit trails for training and safeguards; testing for sandbagging and deceptive behaviour; public registries of evaluation failures.
What it would change
It would determine whether governments should trust lab system cards, require independent audits, mandate deep technical access, or prohibit deployment absent inspectability.
Sub-agenda
- Which dangerous capabilities can be evaluated reliably before deployment?
- How much internal access do auditors need?
- Can evaluations detect latent capabilities or only elicited capabilities?
- How should evaluators test agentic systems with tools and memory?
- What failures should trigger recall, retraining, restricted access, or liability?
Priority (editor scoring)
Current frontier governance leans heavily on tools whose validity remains unsettled.
- Importance
- 5/5
- Neglected
- 3/5
- Difficulty
- 5/5
- Actionable
- 5/5
- Robust
- 4/5
- Nat’l+int’l
- 5/5
Where the catalog bears on this
No current catalog instrument resolves this puzzle — which is the point: it is a foundational question the existing rules leave open. Browse the coverage catalog for what the instruments do and don’t say.
Editorial content — a human-authored agenda question, rendered verbatim. No part of this analysis is AI-generated (see the charter).