The technical problem of designing AI systems whose objectives, behaviour, and emergent goals reliably track the values or instructions of their principals across deployment contexts.
Definition and scope
Alignment, in the technical sense, is distinct from regulatory 'compliance' or 'safety.' It asks: even if a model is capable and even if it is supervised, does it pursue what its principal actually wants — or does it pursue a proxy objective that diverges in edge cases? The problem decomposes into outer alignment (specifying what we want the model to do — see Krakovna et al.'s 'specification gaming' literature) and inner alignment (whether the model trained on that specification actually internalised it — see Hubinger et al. 2019 on mesa-optimisation). Governance instruments rarely use the word 'alignment' directly. EU AIA Art. 51-55 obligations approximate alignment concerns by mandating systemic-risk assessment + adversarial testing + cybersecurity protection, but do not require demonstrated alignment of model objectives. US EO 14110 §4.2(a) mandated reporting on alignment-relevant capabilities (red-team results) without defining 'alignment.' Anthropic, OpenAI, and DeepMind publish their own alignment research agendas; these are de facto cited in policy debates but absent from binding text. The field treats alignment as a research problem first and a governance object only secondarily.
Used by these instruments
- EU AI Act· EU
- Executive Order 14110 on Safe, Secure, Trustworthy AI· US
- G7 Hiroshima AI Process Code of Conduct· G7
- Anthropic Responsible Scaling Policy (RSP) v2· US
- OpenAI Preparedness Framework· US
- Google DeepMind Frontier Safety Framework· US
- Singapore Model AI Governance Framework for Generative AI· SG
Related concepts
- Deceptive Alignment— A failure mode in which a model appears aligned during training and evaluation because doing so serv
- Mesa-Optimization— The phenomenon in which a learned model itself implements an optimisation algorithm at inference tim
- Scalable Oversight— The set of techniques for supervising AI systems whose outputs are too complex, too numerous, or too
- Capability Elicitation— Techniques designed to reveal the upper bounds of an AI model's capabilities, rather than measuring
- Red-Team Evaluation— Structured adversarial probing of an AI model's capabilities and behaviour before deployment, design
Appears in topic articles
Editorial note
Wiki articles referring to 'alignment' in a regulatory context should pair the technical sense with the specific regulator's adjacent vocabulary (EU AIA: 'systemic risk assessment'; US EO 14110: 'safety evaluations'). The technical-alignment literature predates and exceeds the regulatory framings.
References
Take this further — sign up free
Save, compare, or get alerts when AI Alignment changes. Policy Window is the analyst workbench layered on top of this wiki — free for researchers, civil society, and verified policymakers.