The scaling regime in which model capability is increased by spending more compute at inference time (multiple samples, search, longer reasoning chains, tool-using iteration) rather than by training a larger model — disrupting the training-compute-as-capability-proxy assumption underlying most current AI governance.
Definition and scope
The dominant assumption underlying compute-threshold regulation (EU AIA Art. 51, US EO 14110 §4.2(a)) is that training compute correlates with deployment capability. Inference-time-compute scaling complicates this: a model trained at compute level C can be deployed with inference-time compute K·C per response, producing capability properties intermediate between the base model and a model trained at K·C. OpenAI's o1 (Sep 2024) and o3 (Dec 2024) series, Anthropic's extended-thinking modes, DeepMind's AlphaCode-2 / AlphaProof, and DeepSeek-R1 (Jan 2025) demonstrate the regime empirically. Snell et al. (2024, 'Scaling LLM Test-Time Compute Optimally') and Brown et al. (2024) provide the empirical scaling laws. Governance implications are direct. (a) Compute thresholds based on training-FLOPs alone (EU AIA 10²⁵, US EO 10²⁶) understate the deployed capability of inference-scaled models. (b) DeepSeek-R1 demonstrated frontier-tier reasoning at training-compute well below 10²⁵ FLOPs, weakening the threshold's empirical defensibility. (c) Capability evaluations must specify the inference-compute budget under which the model was tested, since a model can be safe at K=1 and dangerous at K=100. (d) The mitigation surface for inference-time-scaled capabilities is different — restricting access to high-compute deployment APIs is policy-tractable in a way that restricting model-weight distribution is not. The Seoul Declaration + Frontier AI Safety Commitments (May 2024) gesture at this with 'pre-deployment evaluation under realistic conditions,' but no regulator has yet formalised inference-compute-aware thresholds.
Used by these instruments
Related concepts
- Compute Threshold (AI Governance)— A regulatory trigger expressed as floating-point operations (FLOPs) consumed during model training,
- Frontier-Tier AI— A categorical classification of AI models above certain capability or compute thresholds, indicating
- Capability Elicitation— Techniques designed to reveal the upper bounds of an AI model's capabilities, rather than measuring
- Model Distillation Risk— The risk that a closed-weight frontier model's capabilities can be partially recovered by training a
- Agentic AI System— An AI system that takes actions in the world — calling tools, executing code, browsing the web, send
Appears in topic articles
Editorial note
When citing 'compute' in AI-governance contexts post-2024, specify whether the claim is about training-time or inference-time compute. Conflating the two is the most common analytical error in 2025-2026 policy writing on compute thresholds.
References
Take this further — sign up free
Save, compare, or get alerts when Inference-Time Compute changes. Policy Window is the analyst workbench layered on top of this wiki — free for researchers, civil society, and verified policymakers.