Print-friendly view · use your browser's Save as PDF option (Cmd/Ctrl-P) to attach this article to a brief.
Inference-Time Compute
inference-time-compute · compute · concept
Source: https://policywindow.org/wiki/inference-time-compute
Generated 2026-05-30T22:10:08 UTC
Summary
The scaling regime in which model capability is increased by spending more compute at inference time (multiple samples, search, longer reasoning chains, tool-using iteration) rather than by training a larger model — disrupting the training-compute-as-capability-proxy assumption underlying most current AI governance.
At a glance
- Used by
- 1 instrument(s)
- Related concepts
- compute-threshold, frontier-tier, capability-elicitation, model-distillation-risk, agentic-system
- Primary source
- Snell, C., Lee, J., Xu, K., Kumar, A. (2024), 'Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters' — establishes inference-time-compute scaling as a first-class capability lever.
- Source URL
- https://arxiv.org/abs/2408.03314
Details
The dominant assumption underlying compute-threshold regulation (EU AIA Art. 51, US EO 14110 §4.2(a)) is that training compute correlates with deployment capability. Inference-time-compute scaling complicates this: a model trained at compute level C can be deployed with inference-time compute K·C per response, producing capability properties intermediate between the base model and a model trained at K·C. OpenAI's o1 (Sep 2024) and o3 (Dec 2024) series, Anthropic's extended-thinking modes, DeepMind's AlphaCode-2 / AlphaProof, and DeepSeek-R1 (Jan 2025) demonstrate the regime empirically. Snell et al. (2024, 'Scaling LLM Test-Time Compute Optimally') and Brown et al. (2024) provide the empirical scaling laws. Governance implications are direct. (a) Compute thresholds based on training-FLOPs alone (EU AIA 10²⁵, US EO 10²⁶) understate the deployed capability of inference-scaled models. (b) DeepSeek-R1 demonstrated frontier-tier reasoning at training-compute well below 10²⁵ FLOPs, weakening the threshold's empirical defensibility. (c) Capability evaluations must specify the inference-compute budget under which the model was tested, since a model can be safe at K=1 and dangerous at K=100. (d) The mitigation surface for inference-time-scaled capabilities is different — restricting access to high-compute deployment APIs is policy-tractable in a way that restricting model-weight distribution is not. The Seoul Declaration + Frontier AI Safety Commitments (May 2024) gesture at this with 'pre-deployment evaluation under realistic conditions,' but no regulator has yet formalised inference-compute-aware thresholds.
How to cite this article
APA
Policy Window. (n.d.). Inference-Time Compute [Wiki article — Concept]. https://policywindow.org/wiki/inference-time-compute
Chicago
Policy Window. n.d.. "Inference-Time Compute." Wiki article (Concept). https://policywindow.org/wiki/inference-time-compute.
Harvard
Policy Window (n.d.) 'Inference-Time Compute', Wiki article — Concept, available at: https://policywindow.org/wiki/inference-time-compute.
OSCOLA
Policy Window, 'Inference-Time Compute' (Wiki article — Concept, n.d.) <https://policywindow.org/wiki/inference-time-compute> accessed [date].
BibTeX
@misc{policywindow-inference-time-compute,
title = {Inference-Time Compute},
author = {Policy Window},
year = {n.d.},
howpublished = {inference-time-compute — compute},
url = {https://policywindow.org/wiki/inference-time-compute},
note = {Primary source: https://arxiv.org/abs/2408.03314}
}