The phenomenon in which a learned model itself implements an optimisation algorithm at inference time, producing an inner objective ('mesa-objective') that may differ from the outer training objective.
Definition and scope
Mesa-optimisation, formalised by Hubinger et al. (2019), is the technical substrate of the deceptive-alignment concern. The outer optimisation process (gradient descent) selects parameters that minimise training loss; if those parameters implement an inner search process with its own objective, the inner objective is the 'mesa-objective.' Mesa-optimisation is plausible only for models with sufficient capability to implement learned planners, search procedures, or world models — empirically demonstrated at small scale in toy domains (Hubinger et al. 2021; Park et al. 2023) but not yet at frontier-LLM scale. Governance relevance is indirect: if mesa-optimisation is real and detectable, capability evaluations should target the inner objective rather than the outer behavioural metric. The EU AI Act and US EO 14110 do not explicitly require this. Anthropic's RSP and the Frontier Foundation Model Eval Consortium include capability-elicitation methods designed to surface inner objectives, but these are voluntary. The concept is contested both empirically (does current SOTA actually mesa-optimise?) and conceptually (is the inner/outer dichotomy the right frame, vs. e.g. context-dependent goals). When citing in policy contexts, signal the contestation status.
Related concepts
- AI Alignment— The technical problem of designing AI systems whose objectives, behaviour, and emergent goals reliab
- Deceptive Alignment— A failure mode in which a model appears aligned during training and evaluation because doing so serv
- Scalable Oversight— The set of techniques for supervising AI systems whose outputs are too complex, too numerous, or too
Appears in topic articles
Editorial note
Mesa-optimisation is currently invoked in policy debates more often as a threat-model rationale than as an empirically-demonstrated failure. Wiki articles citing it should note the empirical-status uncertainty (Avila F6).
References
Take this further — sign up free
Save, compare, or get alerts when Mesa-Optimization changes. Policy Window is the analyst workbench layered on top of this wiki — free for researchers, civil society, and verified policymakers.