The Evaluation and Assurance Problem

Policy Window Editorial Board

The Evaluation and Assurance Problem

Open problem 4

Open problem

Are model evaluations, red-teaming, benchmarks, audits, and safety cases reliable enough to determine whether a system should be trained, deployed, restricted, or recalled?

Why is this problem foundational?

Why it’s foundational

Frontier governance increasingly relies on evaluations. But if evaluations are invalid, gameable, too narrow, or too late in the development cycle, governance becomes theatre.

Why it’s difficult

Models can be scaffolded, fine-tuned, jailbroken, updated, or used in contexts far from the test setting. Pre-deployment evaluations often examine individual models, while many risks arise from training processes, internal use, safeguards, organisational culture, and deployment ecosystems. METR has explicitly noted that standard pre-deployment third-party evaluations have important limitations, including limited visibility into training and safeguards and little time before launch.

Hidden assumptions

The fashionable assumption is that “independent evaluation” equals assurance. It does not, unless evaluators have adequate access, time, authority, adversarial methods, publication rights, and protection from conflicts of interest.

What approaches have been proposed?

6 competing positions are catalogued.

Competing positions

Benchmark-centred evaluation
Red-team-centred evaluation
Audit-centred governance
Interpretability-centred assurance
Safety-case regulation
Scepticism that black-box evaluations can ever certify frontier systems

What could make progress

Longitudinal evaluation of models before and after deployment; evaluator access protocols; adversarially robust benchmarks; audit trails for training and safeguards; testing for sandbagging and deceptive behaviour; public registries of evaluation failures.

What it would change

It would determine whether governments should trust lab system cards, require independent audits, mandate deep technical access, or prohibit deployment absent inspectability.

What sub-questions need research?

The research agenda breaks this into 5 sub-questions.

Sub-agenda

Which dangerous capabilities can be evaluated reliably before deployment?
How much internal access do auditors need?
Can evaluations detect latent capabilities or only elicited capabilities?
How should evaluators test agentic systems with tools and memory?
What failures should trigger recall, retraining, restricted access, or liability?

Priority (editor scoring)

Current frontier governance leans heavily on tools whose validity remains unsettled.

Importance: 5/5
Neglected: 3/5
Difficulty: 5/5
Actionable: 5/5
Robust: 4/5
Nat’l+int’l: 5/5

Where does the catalog bear on this problem?

No current catalog instrument resolves this puzzle — which is the point: it is a foundational question the existing rules leave open. Browse the coverage catalog for what the instruments do and don’t say.

Editorial content — a human-authored agenda question, rendered verbatim. No part of this analysis is AI-generated (see the charter).