The robustness of an AI model's safety training against adversarial prompts crafted to elicit policy-prohibited outputs — distinct from alignment (which concerns the model's goals) and from baseline safety training (which concerns the model's defaults).
Definition and scope
Jailbreak resistance is the operational counterpart to alignment. A model can be 'aligned' in the sense of internalising its principal's intent at training time and still be 'jailbreakable' in the sense that adversarial prompting recovers prohibited behaviours. The attack literature is extensive: roleplay-framing attacks (DAN-style prompts, 2022-2023), encoding attacks (Wei et al. 2023, 'Jailbroken: How Does LLM Safety Training Fail?'), gradient-based suffix attacks (Zou et al. 2023, 'Universal and Transferable Adversarial Attacks on Aligned Language Models'), many-shot jailbreaking (Anil et al. 2024, Anthropic, exploiting long context), and persuasion-style attacks (Zeng et al. 2024, 'How Johnny Can Persuade LLMs to Jailbreak Them'). Industry defences (constitutional classifiers, RLHF + constitutional AI, output filters, multi-stage safety pipelines) are improving but no model has demonstrated full robustness; the white-hat assumption is that adequately-resourced attackers can find a working jailbreak for any current frontier model. Governance relevance: EU AI Act Art. 55(1)(a) adversarial-testing requirement directly targets jailbreak resistance; the testing methodology must include adversarial probing. UK AISI evaluations include public-domain + novel jailbreak probes. NIST AI RMF GenAI Profile §2.6 'Information Security' addresses adversarial robustness. Industry-side frameworks (Anthropic RSP, OpenAI Preparedness, DeepMind FSF) treat jailbreak resistance as one input to capability-tier safeguards — at high CBRN-uplift capability, jailbreak resistance becomes load-bearing for deployment safety.
Used by these instruments
Related concepts
- Red-Team Evaluation— Structured adversarial probing of an AI model's capabilities and behaviour before deployment, design
- AI Alignment— The technical problem of designing AI systems whose objectives, behaviour, and emergent goals reliab
- Capability Elicitation— Techniques designed to reveal the upper bounds of an AI model's capabilities, rather than measuring
- Multi-Turn Evaluation— An evaluation methodology that probes AI models across multi-step conversations rather than single p
- Prompt Injection— An adversarial input technique in which untrusted content fed to an AI model (e.g., text on a webpag
- Data Poisoning— A training-time attack in which an adversary inserts crafted examples into the training corpus or fi
Appears in topic articles
Editorial note
Distinguish jailbreak resistance (robustness to adversarial elicitation of prohibited outputs) from alignment (whether the model's goals match the principal's) and from prompt injection (whether untrusted content can hijack the instruction channel). All three are necessary but none is sufficient for deployment safety.
References
Take this further — sign up free
Save, compare, or get alerts when Jailbreak Resistance changes. Policy Window is the analyst workbench layered on top of this wiki — free for researchers, civil society, and verified policymakers.