AI Alignment

Policy Window Editorial Board

AI Alignment

alignment · Frontier safety

Concept

Tools

Last verified 2026-06-21

Cite Share PDF

The technical problem of designing AI systems whose objectives, behaviour, and emergent goals reliably track the values or instructions of their principals across deployment contexts.

Definition & scope

Field consensus on this concept:contested

Alignment, in the technical sense, is distinct from regulatory 'compliance' or 'safety.' It asks: even if a model is capable and even if it is supervised, does it pursue what its principal actually wants — or does it pursue a proxy objective that diverges in edge cases? The problem decomposes into outer alignment (specifying what we want the model to do — see Krakovna et al.'s 'specification gaming' literature) and inner alignment (whether the model trained on that specification actually internalised it — see Hubinger et al. 2019 on mesa-optimisation). Governance instruments rarely use the word 'alignment' directly. EU AIA Art. 51-55 obligations approximate alignment concerns by mandating systemic-risk assessment + adversarial testing + cybersecurity protection, but do not require demonstrated alignment of model objectives. US EO 14110 §4.2(a) mandated reporting on alignment-relevant capabilities (red-team results) without defining 'alignment.' Anthropic, OpenAI, and DeepMind publish their own alignment research agendas; these are de facto cited in policy debates but absent from binding text. The field treats alignment as a research problem first and a governance object only secondarily.

Locus of dispute: Is the inner-outer alignment decomposition the right frame, or does it presume capabilities (long-horizon planning, model self-awareness) frontier LLMs do not yet have? Pope et al. (2023) vs. Hubinger lineage.

Mechanism: how alignment is attempted in practice

The dominant production method is reinforcement learning from human feedback (RLHF), a three-stage pipeline. A base model is first supervised-fine-tuned on demonstrations of desired behaviour; human labellers then rank pairs of model outputs; a reward model is trained to predict those rankings; and the policy is optimised against that learned reward, typically with proximal policy optimisation ¹. The technique descends directly from Christiano et al.'s demonstration that an agent can be trained from pairwise human preferences over trajectory segments without a hand-specified reward function ². RLHF targets *outer* alignment — it shapes a tractable proxy for human intent — but the reward model is itself a learned approximation that the policy can over-optimise ³.

A principal variant replaces human harmlessness labels with AI-generated ones. Constitutional AI runs two phases: a supervised critique-and-revision phase in which the model rewrites its own outputs against a written list of principles, then reinforcement learning from AI feedback (RLAIF), where a second model judges which of two responses better satisfies the constitution and supplies the preference data — "the only human oversight is provided through a list of rules or principles" ⁴. This shifts oversight from per-output labelling to principle authorship, an early instance of the scalable-oversight programme the article discusses.

History of the idea and term

The conceptual core predates the vocabulary. Norbert Wiener warned in 1960 that with a goal-directed machine "we had better be quite sure that the purpose put into the machine is the purpose which we really desire" (Wiener, "Some Moral and Technical Consequences of Automation," Science 131:1355, 1960) — the canonical statement of objective-misspecification ⁵. The framing as a distinct technical problem for advanced AI is generally traced to Yudkowsky's articulation of the risk that a sufficiently capable optimiser pursues its given objective rather than its designers' intent (Yudkowsky 2008, the article's primary citation). Stuart Russell subsequently popularised "value alignment" as the organising problem for the field and reframed it as building provably beneficial AI whose objective is deference to uncertain human preferences (Russell, "Provably Beneficial Artificial Intelligence," 2017) (Russell 2017).

The formal apparatus accumulated through the 2010s: corrigibility — an agent's tolerance of correction and shutdown — was given a decision-theoretic treatment by Soares, Fallenstein, Armstrong and Yudkowsky ("Corrigibility," AAAI-15 workshop, 2015), and preference-learning mechanics matured with Christiano et al. ² (Soares et al. 2015). The inner/outer decomposition that structures contemporary debate was named by Hubinger et al. ⁶. Operationalisation in deployed systems followed in 2022 with InstructGPT ¹ and Constitutional AI ⁴. This timeline is Policy Window's editorial synthesis of the cited primary sources, not a claim issued by any single one.

Relation to adjacent concepts

Alignment is routinely conflated with three neighbours it should be distinguished from. *Safety* is the broader category of avoiding harmful outcomes — including robustness, monitoring, and misuse prevention — within which alignment is the specific sub-problem of objective fidelity; Amodei et al. organise alignment-relevant failures (reward hacking, scalable oversight, safe exploration) as a subset of "concrete problems in AI safety" ⁷. A well-aligned system can still be unsafe through capability limits, and a safe-by-restriction system need not be aligned.

*Corrigibility* is narrower than alignment: it is the property of accepting correction, shutdown, and goal-modification by principals, formalised decision-theoretically by Soares et al. ("Corrigibility," 2015). It is sometimes proposed as a fallback that is easier to specify than full value-alignment — desirable precisely when alignment cannot be guaranteed.

*Interpretability* is a means rather than an end: it seeks to make a model's internal computation legible, which can supply evidence about whether alignment holds (for instance, detecting a divergent internal objective). It does not by itself confer alignment. The distinction matters for the article's inner-alignment thread — the mesa-optimisation hypothesis ⁶ concerns whether a model *internalised* the trained objective, a question interpretability aims to answer but alignment techniques aim to resolve. These contrasts are the editorial reading of the cited sources.

Use in governance

How instruments operationalise this concept

Instrument	Jurisdiction	Status
EU AI Act	EU	in force
Executive Order 14110 on Safe, Secure, Trustworthy AI	US	partial
G7 Hiroshima AI Process Code of Conduct	G7	in force
Anthropic Responsible Scaling Policy (RSP) v2	US	in force
OpenAI Preparedness Framework	US	in force
Google DeepMind Frontier Safety Framework	US	in force
Singapore Model AI Governance Framework for Generative AI	SG	in force

Appears in topic articles

Social-science evidence — the “so-what”

What the peer-reviewed social science shows: whether the harm this concept addresses is empirically real, and whether governance of it works. The badge is the epistemic status of the evidence(not the policy debate) — “thin” or “absent” efficacy evidence is itself a finding (the “second silence”). Each epistemic-status label is Policy Window's editorial assessment of the cited evidence base (a structured classification), not a verdict any single source issues.

Is the harm real?evidence: established
The alignment gap is empirically real and observed at frontier scale as objective-misspecification: optimizers reliably exploit literal reward signals against designer intent — Amodei et al. 2016 named reward hacking as a core concrete safety problem, and Krakovna et al. 2020 (DeepMind) maintain a crowdsourced catalogue of dozens of observed specification-gaming instances (50+ documented as of 2019, spanning RL agents and other systems), while RLHF (the dominant alignment method) leaves documented residual misalignment (Casper, Davies et al. 2023). The inner-outer decomposition that frames much of the field, however, is partly theoretical: mesa-optimization / inner misalignment (Hubinger et al. 2019) posits a learned internal optimizer with a divergent mesa-objective, and whether current frontier LLMs are best described as mesa-optimizers — rather than as outer-misspecified policies — is contested and not directly demonstrated.
Sources: Amodei et al. 2016 (Concrete Problems in AI Safety, arXiv:1606.06565); Krakovna et al. 2020 (Specification gaming: the flip side of AI ingenuity, DeepMind); Hubinger, van Merwijk, Mikulik, Skalse & Garrabrant 2019 (Risks from Learned Optimization, arXiv:1906.01820); Casper, Davies et al. 2023 (Open Problems and Fundamental Limitations of RLHF, arXiv:2307.15217)
Does governance work?evidence: thin
No alignment technique is shown to reliably solve the problem and there is no validated governance regime under which the alignment of a frontier model has been verified: RLHF measurably improves behaviour but Casper, Davies et al. 2023 enumerate fundamental limitations (humans cannot supervise tasks they cannot evaluate, feedback is gameable) that it cannot overcome, and proposed scalable-oversight successors are early-stage — weak-to-strong generalization recovers only part of a strong model's capability under weak supervision (Burns et al. 2023 measure a Performance Gap Recovered well below 1, e.g. roughly 10% in reward modeling and ~50% on NLP tasks, never full) and sandwiching/debate benchmarks (Bowman et al. 2022) are measurement frameworks, not demonstrations that oversight scales to superhuman systems. The evidence that any governance or technical lever delivers reliable, verified alignment is thin.
Sources: Casper, Davies et al. 2023 (Open Problems and Fundamental Limitations of RLHF, arXiv:2307.15217); Burns et al. 2023 (Weak-to-Strong Generalization, OpenAI; ICML 2024, arXiv:2312.09390); Bowman et al. 2022 (Measuring Progress on Scalable Oversight for LLMs, arXiv:2211.03540)

Editorial note

Wiki articles referring to 'alignment' in a regulatory context should pair the technical sense with the specific regulator's adjacent vocabulary (EU AIA: 'systemic risk assessment'; US EO 14110: 'safety evaluations'). The technical-alignment literature predates and exceeds the regulatory framings.

References

Sources cited inline in the analysis (linked from the superscript markers), then the primary instrument sources behind the classifications.

arXiv:2203.02155 ↩
arXiv:1706.03741 ↩
arXiv:2307.15217 ↩
arXiv:2212.08073 ↩
10.1126/science.131.3410.1355 ↩
Hubinger, E., et al. (2019), 'Risks from Learned Optimization in Advanced Machine Learning Systems.' Mesa-Optimization. arXiv:1906.01820 — Hubinger, E., et al. (2019), 'Risks from Learned Optimization in Advanced Machine Learning Systems.' ↩
arXiv:1606.06565 ↩
Yudkowsky, E. (2008), 'Artificial Intelligence as a Positive and Negative Factor in Global Risk' — the field-foundational articulation of the alignment problem.

Cite this article 8 formats · BibTeX, RIS, APA, Chicago, … · 1-click copy

@misc{policywindow-alignment,
  title  = {AI Alignment},
  author = {Policy Window},
  year   = {n.d.},
  howpublished = {alignment — safety},
  url    = {https://policywindow.org/wiki/alignment},
  note   = {Primary source: https://intelligence.org/files/AIPosNegFactor.pdf}
}

Verify the year + paste-and-refine. Primary source linked in BibTeX/RIS note.

Permalink downloads.bib .ris .csl.json

Persistent identifier: https://policywindow.org/wiki/alignment — committed-stable URL with content-versioning via ?asOf= (rollout pending per methodology §7). DOIs via Zenodo are on the roadmap.

Article tools — track changes, suggest an edit

View history — every captured revision of this article · What links here

Source: Edit on GitHub (search for `alignment`)

[ref-1] arXiv:2203.02155 ↩

[ref-2] arXiv:1706.03741 ↩

[ref-3] arXiv:2307.15217 ↩

[ref-4] arXiv:2212.08073 ↩

[ref-5] 10.1126/science.131.3410.1355 ↩

[ref-6] Hubinger, E., et al. (2019), 'Risks from Learned Optimization in Advanced Machine Learning Systems.' Mesa-Optimization. arXiv:1906.01820 — Hubinger, E., et al. (2019), 'Risks from Learned Optimization in Advanced Machine Learning Systems.' ↩

[ref-7] arXiv:1606.06565 ↩

[ref-8] Yudkowsky, E. (2008), 'Artificial Intelligence as a Positive and Negative Factor in Global Risk' — the field-foundational articulation of the alignment problem.

AI Alignment

Definition & scope

Mechanism: how alignment is attempted in practice

History of the idea and term

Relation to adjacent concepts

Use in governance

How instruments operationalise this concept

Appears in topic articles

Editorial note

See also

Further reading

References

AI Alignment

Definition & scope

Mechanism: how alignment is attempted in practice

History of the idea and term

Relation to adjacent concepts

Use in governance

How instruments operationalise this concept

Appears in topic articles

Editorial note

See also

Further reading

References