An adversarial input technique in which untrusted content fed to an AI model (e.g., text on a webpage the model reads, a document the user uploads, a tool's output) contains instructions that override the model's intended behaviour or principal-provided system prompt.
Definition and scope
Prompt injection was named by Willison (2022, 'Prompt injection attacks against GPT-3') and formalised by Greshake et al. (2023, 'Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection'). The attack class splits into two sub-cases: (a) direct prompt injection — the user (or attacker posing as user) submits adversarial text in the prompt; mitigated partly by training-time alignment + system-prompt design; (b) indirect prompt injection — the model ingests untrusted content (a webpage during browsing, a PDF the user uploads, the output of a tool call) which contains adversarial instructions; the model cannot reliably distinguish 'data' from 'instructions' because both share the same token-stream interface. Indirect injection is the more serious failure mode at deployment because the attacker doesn't need access to the user's session. NIST AI RMF GenAI Profile (NIST AI 600-1) names prompt injection in the 'Information Security' risk category. EU AI Act Art. 15 ('cybersecurity' requirement for high-risk and Art. 55 for GPAI with systemic risk) is the closest binding obligation — providers must protect against 'attempts by unauthorised third parties to alter the use, behaviour or performance of the system.' Industry mitigations (constitutional classifiers, dual-LLM gateway patterns, content-isolation tags) are evolving rapidly but no architectural defence is yet known to be robust. The OWASP LLM Top 10 (2023, 2025 update) lists prompt injection as LLM01 — the most-cited application-security risk for LLM-integrated software.
Used by these instruments
Related concepts
- Agentic AI System— An AI system that takes actions in the world — calling tools, executing code, browsing the web, send
- Tool-Use Safety— The sub-domain of agentic-system safety concerned with the risks that arise when an AI model invokes
- Jailbreak Resistance— The robustness of an AI model's safety training against adversarial prompts crafted to elicit policy
- Data Poisoning— A training-time attack in which an adversary inserts crafted examples into the training corpus or fi
- Retrieval-Augmented Generation (RAG)— An AI system pattern in which a model's outputs are conditioned on external content retrieved at inf
Appears in topic articles
Editorial note
Distinguish prompt injection (instruction-channel attack via shared token stream) from jailbreaking (adversarial-prompt attack targeting alignment training) and from data poisoning (training-time attack). The three are often conflated in policy text but require different mitigations.
References
Take this further — sign up free
Save, compare, or get alerts when Prompt Injection changes. Policy Window is the analyst workbench layered on top of this wiki — free for researchers, civil society, and verified policymakers.