Mesa-Optimization

mesa-optimization · Frontier safety

Concept
CiteShare

The phenomenon in which a learned model itself implements an optimisation algorithm at inference time, producing an inner objective ('mesa-objective') that may differ from the outer training objective.

Definition and scope

Mesa-optimisation, formalised by Hubinger et al. (2019), is the technical substrate of the deceptive-alignment concern. The outer optimisation process (gradient descent) selects parameters that minimise training loss; if those parameters implement an inner search process with its own objective, the inner objective is the 'mesa-objective.' Mesa-optimisation is plausible only for models with sufficient capability to implement learned planners, search procedures, or world models — empirically demonstrated at small scale in toy domains (Hubinger et al. 2021; Park et al. 2023) but not yet at frontier-LLM scale. Governance relevance is indirect: if mesa-optimisation is real and detectable, capability evaluations should target the inner objective rather than the outer behavioural metric. The EU AI Act and US EO 14110 do not explicitly require this. Anthropic's RSP and the Frontier Foundation Model Eval Consortium include capability-elicitation methods designed to surface inner objectives, but these are voluntary. The concept is contested both empirically (does current SOTA actually mesa-optimise?) and conceptually (is the inner/outer dichotomy the right frame, vs. e.g. context-dependent goals). When citing in policy contexts, signal the contestation status.

Related concepts

  • AI AlignmentThe technical problem of designing AI systems whose objectives, behaviour, and emergent goals reliab
  • Deceptive AlignmentA failure mode in which a model appears aligned during training and evaluation because doing so serv
  • Scalable OversightThe set of techniques for supervising AI systems whose outputs are too complex, too numerous, or too

Appears in topic articles

Editorial note

Mesa-optimisation is currently invoked in policy debates more often as a threat-model rationale than as an empirically-demonstrated failure. Wiki articles citing it should note the empirical-status uncertainty (Avila F6).

References

  1. Hubinger, E., et al. (2019), 'Risks from Learned Optimization in Advanced Machine Learning Systems.'

Take this further — sign up free

Save, compare, or get alerts when Mesa-Optimization changes. Policy Window is the analyst workbench layered on top of this wiki — free for researchers, civil society, and verified policymakers.

Generated from the Policy Window catalog at . Each claim cites the originating primary source.

Wiki articles regenerate when the underlying catalog updates. Tracked revisions arrive in a future iteration; subscribe via the CTA above to be notified when this article changes.