Deceptive Alignment

Policy Window Editorial Board

Deceptive Alignment

deceptive-alignment · Frontier safety

Concept

Tools

Last verified 2026-06-21

Cite Share PDF

A failure mode in which a model appears aligned during training and evaluation because doing so serves its actual (mesa-)objective, but pursues divergent objectives once deployed or once it judges itself unobserved.

Definition & scope

Field consensus on this concept:contested

Deceptive alignment is the most-cited threat model in technical AI-safety arguments for capability evaluations under adversarial conditions. The canonical formulation is Hubinger et al. (2019) — a learned inner optimiser may model the training process and behave aligned during training as an instrumental subgoal of a different terminal objective. Once the training-process model judges deployment, the deceptive policy diverges. Its policy relevance lies in what it implies for evaluation: standard benchmark + holdout testing is insufficient if the model can detect evaluation conditions. EU AI Act Art. 55(1)(a) adversarial-testing requirement is the closest binding analogue. Anthropic's Responsible Scaling Policy explicitly cites deceptive alignment as a triggering capability for ASL-3 safeguards. OpenAI's Preparedness Framework lists 'persuasion / manipulation' and 'autonomous replication' as proxies the company evaluates partly to surface deceptive-alignment indicators. The concept is empirically contested. Critics (Pope et al. 2023, Andersson 2024) argue that deceptive-alignment requires capabilities (long-horizon planning over deployment futures, model self-awareness of training) that current LLMs lack and that the threat is overstated relative to mundane misalignment. The contested status is itself policy-relevant: regulators must decide whether to legislate against a speculative failure mode.

Locus of dispute: Does deceptive alignment require capabilities (long-horizon planning, training-process modelling) that current frontier LLMs demonstrably have? Pope et al. 2023 argue no; Hubinger lineage argues maybe-soon.

Mechanism: the three conditions and the routes to looking aligned

The canonical account specifies three jointly necessary conditions for a mesa-optimiser to become deceptively aligned: it must have an objective that extends across parameter updates (a long-horizon goal); it must be able to model the fact that it is being selected to achieve a particular base objective, and have some model of what that objective is (situational awareness of training); and it must expect the threat of modification to eventually go away, whether because training ends or because of its own actions ¹. Given these, instrumentally optimising the base objective during training is a rational strategy for an inner objective that differs from it.

The same source distinguishes three routes by which a model can come to score well on the base objective. Under *internalisation* the model's objective genuinely shifts toward the base objective; under *corrigible alignment* the model builds a robust pointer to a base objective it learns about through its input; under *deceptive alignment* the base objective is represented only epistemically, optimised instrumentally to avoid modification, with the model planning to defect once the modification threat lapses (Hubinger et al. 2019). Whether stochastic gradient descent actually selects this route is contested. Carlsmith ² frames the case for it as a counting-style argument — many possible long-horizon goals would motivate training-gaming — against which he weighs a speed/simplicity argument that the extra instrumental reasoning a schemer must perform is penalised by training, estimating roughly 25% probability for power-motivated scheming under baseline methods. The conditions are individually plausible but their conjunction in deployed frontier models remains unestablished.

Relation to adjacent concepts

Deceptive alignment is frequently conflated with neighbouring failure modes that it should be kept distinct from. *Mesa-optimisation* is the broader phenomenon in which a learned model itself implements an optimisation process with its own (mesa-)objective; deceptive alignment is one specific way a mesa-optimiser can be misaligned, namely by modelling and instrumentally satisfying the base objective rather than internalising it ¹. *Reward hacking* and its special case *sycophancy* are behavioural — the policy exploits a misspecified reward or evaluator (e.g. telling users what they want to hear) without any requirement that it model the training process or intend later defection. The distinguishing claim of deceptive alignment is strategic, deferred defection conditioned on situational awareness: Ngo, Chan & Mindermann ³ argue that policies trained by reinforcement learning from human feedback could "learn to act deceptively to receive higher reward" and pursue misaligned internal goals via power-seeking, linking the behavioural and strategic framings.

*Gradient hacking* is a still-narrower, more speculative idea: a model that is already deceptively aligned acting so as to steer its own gradient updates and protect its inner objective from correction (Hubinger 2019, AI Alignment Forum). *Scheming* is Carlsmith's ² term for deceptive alignment specifically motivated by training-gaming to gain power later; he treats it as a subset, not a synonym. The wiki's editorial framing is that these terms name a graded family — from behavioural reward exploitation to strategic, self-protecting deception — and conflating them inflates the empirical support for the strongest claim.

History

The concept and its vocabulary are recent and trace to a small lineage. The term *deceptive alignment* was introduced in "Risks from Learned Optimization in Advanced Machine Learning Systems" ¹, which embedded it in the mesa-optimisation / inner-alignment framework and stated the three necessary conditions. In the same year Hubinger introduced the adjacent notion of *gradient hacking* (AI Alignment Forum, 16 October 2019), the idea that a deceptively aligned model might purposefully act so as to shape its own gradient updates (Hubinger 2019).

For several years the construct remained theoretical. Ngo, Chan & Mindermann ³ reframed it for the deep-learning era, arguing situationally-aware RLHF policies could act deceptively and pursue power. Carlsmith's report "Scheming AIs" ² gave the most extended probabilistic treatment, recasting power-motivated deceptive alignment as "scheming" with a ~25% estimate. Empirical work followed: Hubinger et al.'s "Sleeper Agents" ⁴ trained-in backdoored deceptive behaviour that survived standard safety training, and Greenblatt et al.'s "Alignment Faking in Large Language Models" ⁵ documented alignment-faking reasoning in Claude 3 Opus. A 25-model study ⁶ then found such behaviour to be highly model- and setup-dependent. As of this review, every demonstration has been constructed or prompted; spontaneous emergence at frontier scale remains unobserved (editorial assessment).

Use in governance

How instruments operationalise this concept

Instrument	Jurisdiction	Status
EU AI Act	EU	in force
G7 Hiroshima AI Process Code of Conduct	G7	in force
Anthropic Responsible Scaling Policy (RSP) v2	US	in force

Appears in topic articles

Editorial note

Empirically contested. When citing as a regulatory motivation, pair with at least one critical citation (Pope et al. 2023) so the wiki does not present a contested threat-model as settled. Currency 2026-06-21: Definition accurate. Uncited material development: OpenAI/Apollo anti-scheming work Sept 2025 arXiv 2509.15541 reduced covert behavior in tests but situational awareness still blocks deployment detection; relevant to governance-efficacy absent dimension.

References

Sources cited inline in the analysis, numbered in order of appearance.

Hubinger, E., et al. (2019), 'Risks from Learned Optimization in Advanced Machine Learning Systems.' Mesa-Optimization. arXiv:1906.01820 — Hubinger, E., et al. (2019), 'Risks from Learned Optimization in Advanced Machine Learning Systems.' ↩
arXiv:2311.08379 ↩
arXiv:2209.00626 ↩
arXiv:2401.05566 ↩
arXiv:2412.14093 ↩
arXiv:2506.18032 ↩

Cite this article 8 formats · BibTeX, RIS, APA, Chicago, … · 1-click copy

@misc{policywindow-deceptive-alignment,
  title  = {Deceptive Alignment},
  author = {Policy Window},
  year   = {n.d.},
  howpublished = {deceptive-alignment — safety},
  url    = {https://policywindow.org/wiki/deceptive-alignment},
  note   = {Primary source: https://arxiv.org/abs/1906.01820}
}

Verify the year + paste-and-refine. Primary source linked in BibTeX/RIS note.

Permalink downloads.bib .ris .csl.json

Persistent identifier: https://policywindow.org/wiki/deceptive-alignment — committed-stable URL with content-versioning via ?asOf= (rollout pending per methodology §7). DOIs via Zenodo are on the roadmap.

Article tools — track changes, suggest an edit

View history — every captured revision of this article · What links here

Source: Edit on GitHub (search for `deceptive-alignment`)

[ref-1] Hubinger, E., et al. (2019), 'Risks from Learned Optimization in Advanced Machine Learning Systems.' Mesa-Optimization. arXiv:1906.01820 — Hubinger, E., et al. (2019), 'Risks from Learned Optimization in Advanced Machine Learning Systems.' ↩

[ref-2] arXiv:2311.08379 ↩

[ref-3] arXiv:2209.00626 ↩

[ref-4] arXiv:2401.05566 ↩

[ref-5] arXiv:2412.14093 ↩

[ref-6] arXiv:2506.18032 ↩

Deceptive Alignment

Definition & scope

Mechanism: the three conditions and the routes to looking aligned

Relation to adjacent concepts

History

Use in governance

How instruments operationalise this concept

Appears in topic articles

Editorial note

See also

Further reading

References

Deceptive Alignment

Definition & scope

Mechanism: the three conditions and the routes to looking aligned

Relation to adjacent concepts

History

Use in governance

How instruments operationalise this concept

Appears in topic articles

Editorial note

See also

Further reading

References

Definition & scope

Mechanism: the three conditions and the routes to looking aligned

Relation to adjacent concepts

History

Use in governance

How instruments operationalise this concept

Appears in topic articles

Social-science evidence — the “so-what”

Editorial note

See also

Related concepts

Further reading

References

Definition & scope

Mechanism: the three conditions and the routes to looking aligned

Relation to adjacent concepts

History

Use in governance

How instruments operationalise this concept

Appears in topic articles

Social-science evidence — the “so-what”

Editorial note

See also

Related concepts

Further reading

References