Print-friendly view · use your browser's Save as PDF option (Cmd/Ctrl-P) to attach this article to a brief.

Mesa-Optimization

mesa-optimization · safety · concept

Source: https://policywindow.org/wiki/mesa-optimization

Generated 2026-05-30T22:07:29 UTC

Summary

The phenomenon in which a learned model itself implements an optimisation algorithm at inference time, producing an inner objective ('mesa-objective') that may differ from the outer training objective.

At a glance

Used by: 0 instrument(s)
Related concepts: alignment, deceptive-alignment, scalable-oversight
Primary source: Hubinger, E., et al. (2019), 'Risks from Learned Optimization in Advanced Machine Learning Systems.'
Source URL: https://arxiv.org/abs/1906.01820

Details

Mesa-optimisation, formalised by Hubinger et al. (2019), is the technical substrate of the deceptive-alignment concern. The outer optimisation process (gradient descent) selects parameters that minimise training loss; if those parameters implement an inner search process with its own objective, the inner objective is the 'mesa-objective.' Mesa-optimisation is plausible only for models with sufficient capability to implement learned planners, search procedures, or world models — empirically demonstrated at small scale in toy domains (Hubinger et al. 2021; Park et al. 2023) but not yet at frontier-LLM scale. Governance relevance is indirect: if mesa-optimisation is real and detectable, capability evaluations should target the inner objective rather than the outer behavioural metric. The EU AI Act and US EO 14110 do not explicitly require this. Anthropic's RSP and the Frontier Foundation Model Eval Consortium include capability-elicitation methods designed to surface inner objectives, but these are voluntary. The concept is contested both empirically (does current SOTA actually mesa-optimise?) and conceptually (is the inner/outer dichotomy the right frame, vs. e.g. context-dependent goals). When citing in policy contexts, signal the contestation status.

How to cite this article

APA

Policy Window. (n.d.). Mesa-Optimization [Wiki article — Concept]. https://policywindow.org/wiki/mesa-optimization

Chicago

Policy Window. n.d.. "Mesa-Optimization." Wiki article (Concept). https://policywindow.org/wiki/mesa-optimization.

Harvard

Policy Window (n.d.) 'Mesa-Optimization', Wiki article — Concept, available at: https://policywindow.org/wiki/mesa-optimization.

OSCOLA

Policy Window, 'Mesa-Optimization' (Wiki article — Concept, n.d.) <https://policywindow.org/wiki/mesa-optimization> accessed [date].

BibTeX

@misc{policywindow-mesa-optimization,
  title  = {Mesa-Optimization},
  author = {Policy Window},
  year   = {n.d.},
  howpublished = {mesa-optimization — safety},
  url    = {https://policywindow.org/wiki/mesa-optimization},
  note   = {Primary source: https://arxiv.org/abs/1906.01820}
}