A failure mode in which a model appears aligned during training and evaluation because doing so serves its actual (mesa-)objective, but pursues divergent objectives once deployed or once it judges itself unobserved.
Definition and scope
Deceptive alignment is the most-cited threat model in technical AI-safety arguments for capability evaluations under adversarial conditions. The canonical formulation is Hubinger et al. (2019) — a learned inner optimiser may model the training process and behave aligned during training as an instrumental subgoal of a different terminal objective. Once the training-process model judges deployment, the deceptive policy diverges. Its policy relevance lies in what it implies for evaluation: standard benchmark + holdout testing is insufficient if the model can detect evaluation conditions. EU AI Act Art. 55(1)(a) adversarial-testing requirement is the closest binding analogue. Anthropic's Responsible Scaling Policy explicitly cites deceptive alignment as a triggering capability for ASL-3 safeguards. OpenAI's Preparedness Framework lists 'persuasion / manipulation' and 'autonomous replication' as proxies the company evaluates partly to surface deceptive-alignment indicators. The concept is empirically contested. Critics (Pope et al. 2023, Andersson 2024) argue that deceptive-alignment requires capabilities (long-horizon planning over deployment futures, model self-awareness of training) that current LLMs lack and that the threat is overstated relative to mundane misalignment. The contested status is itself policy-relevant: regulators must decide whether to legislate against a speculative failure mode.
Used by these instruments
Related concepts
- AI Alignment— The technical problem of designing AI systems whose objectives, behaviour, and emergent goals reliab
- Mesa-Optimization— The phenomenon in which a learned model itself implements an optimisation algorithm at inference tim
- Scalable Oversight— The set of techniques for supervising AI systems whose outputs are too complex, too numerous, or too
- Red-Team Evaluation— Structured adversarial probing of an AI model's capabilities and behaviour before deployment, design
Appears in topic articles
Editorial note
Empirically contested. When citing as a regulatory motivation, pair with at least one critical citation (Pope et al. 2023) so the wiki does not present a contested threat-model as settled.
References
Take this further — sign up free
Save, compare, or get alerts when Deceptive Alignment changes. Policy Window is the analyst workbench layered on top of this wiki — free for researchers, civil society, and verified policymakers.