The Threshold and Tripwire Problem

Policy Window Editorial Board

The Threshold and Tripwire Problem

Open problem 5

Open problem

Which thresholds should trigger reporting, licensing, external evaluation, compute controls, restricted deployment, model-weight security requirements, or a pause?

Why is this problem foundational?

Why it’s foundational

Without thresholds, governance is discretionary and reactive. With bad thresholds, governance is gameable, obsolete, or dangerously overbroad.

Why it’s difficult

Compute thresholds are measurable but imperfect proxies for risk. Capability thresholds are more relevant but harder to measure. Behavioural thresholds may appear only after deployment. Risk thresholds require normative judgement.

Hidden assumptions

Many proposals assume thresholds can be set once and updated smoothly. In reality, thresholds are political commitments under technical uncertainty. They create incentives to train just below, hide capabilities, manipulate evaluations, or shift development offshore.

What approaches have been proposed?

7 competing positions are catalogued.

Competing positions

Compute-based thresholds
Capability-based thresholds
Risk-based thresholds
Deployment-context thresholds
Case-by-case regulator discretion
Hard moratoria on specific capabilities
No thresholds, only ex post liability

What could make progress

Empirical mapping between compute, architecture, data, post-training, scaffolding, and capabilities; legal work on threshold design; simulations of threshold gaming; audit pilots using confidential regulator access; comparative study of safety frameworks. The Seoul Frontier AI Safety Commitments already push developers toward severe-risk thresholds and processes for not developing or deploying systems if mitigations cannot keep risks below those thresholds; the unresolved problem is whether such thresholds can be made public, binding, legitimate, and enforceable.

What it would change

It would define the operational backbone of frontier AI regulation: who must notify whom, when, and what must stop if the threshold is crossed.

What sub-questions need research?

The research agenda breaks this into 5 sub-questions.

Sub-agenda

Which capabilities are legitimate “red lines”?
How should thresholds account for scaffolding and tool use?
Can compute thresholds remain useful as algorithms become more efficient?
What governance response should follow each threshold?
Who has authority to update thresholds?

Priority (editor scoring)

Thresholds convert principles into action, but are gameable and politically loaded.

Importance: 5/5
Neglected: 3/5
Difficulty: 4/5
Actionable: 5/5
Robust: 4/5
Nat’l+int’l: 5/5

Where does the catalog bear on this problem?

2 catalogued instruments touch this problem.

Instruments this puzzle names that are in the catalog:

Editorial content — a human-authored agenda question, rendered verbatim. No part of this analysis is AI-generated (see the charter).