Prompt Injection

Policy Window Editorial Board

Prompt Injection

prompt-injection · Frontier safety

Concept

Tools

Last verified 2026-06-21

Cite Share PDF

An adversarial input technique in which untrusted content fed to an AI model (e.g., text on a webpage the model reads, a document the user uploads, a tool's output) contains instructions that override the model's intended behaviour or principal-provided system prompt.

Definition & scope

Field consensus on this concept:settled

Prompt injection was named by Willison (2022, 'Prompt injection attacks against GPT-3') and formalised by Greshake et al. (2023, 'Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection'). The attack class splits into two sub-cases: (a) direct prompt injection — the user (or attacker posing as user) submits adversarial text in the prompt; mitigated partly by training-time alignment + system-prompt design; (b) indirect prompt injection — the model ingests untrusted content (a webpage during browsing, a PDF the user uploads, the output of a tool call) which contains adversarial instructions; the model cannot reliably distinguish 'data' from 'instructions' because both share the same token-stream interface. Indirect injection is the more serious failure mode at deployment because the attacker doesn't need access to the user's session. NIST AI RMF GenAI Profile (NIST AI 600-1) names prompt injection in the 'Information Security' risk category. EU AI Act Art. 15 ('cybersecurity' requirement for high-risk and Art. 55 for GPAI with systemic risk) is the closest binding obligation — providers must protect against 'attempts by unauthorised third parties to alter the use, behaviour or performance of the system.' Industry mitigations (constitutional classifiers, dual-LLM gateway patterns, content-isolation tags) are evolving rapidly but no architectural defence is yet known to be robust. The OWASP LLM Top 10 (2023, 2025 update) lists prompt injection as LLM01 — the most-cited application-security risk for LLM-integrated software.

Mechanism: why models cannot separate data from instructions

The structural root cause of prompt injection is that a language model receives its system prompt, the user's request, and any ingested external content as a single, undifferentiated token sequence; there is no in-band privilege boundary marking which spans are trusted instructions and which are inert data ¹. Willison named the class by analogy to SQL injection, where an attacker's data is mis-parsed as executable code because the application fails to keep the two channels separate. Liu et al. ² formalise an injected prompt as a compromised-data input that the application concatenates into the model context, and decompose attacks into delivery (where the payload enters: direct user input vs. retrieved/tool-returned content) and effect (goal-hijacking, prompt-leaking, or denial-of-service).

Concrete payload techniques exploit this single channel rather than any model 'bug': naive instruction insertion ('ignore previous instructions'), escape/fake-completion sequences that simulate the end of the data block and the start of a new privileged turn, and context-ignoring or role-impersonation phrasings ³. Because the failure is architectural, payloads can be hidden in any modality the model parses—white-on-white webpage text, image pixels, or document metadata—so long as it reaches the context window (Greshake et al. 2023). This editorial synthesis frames these as variants of one mechanism, not separate vulnerabilities.

Open debate: is prompt injection solvable, and at which layer?

The central contested question is whether prompt injection is a defect that better engineering will close or a structural property of instruction-following models that can only be contained. Proposed defenses cluster into three layers, each with a documented limitation. Prompt-level / detection defenses (delimiters, re-prompting, or a classifier that screens inputs) are the most deployable but the least robust: Liu et al. ² found prevention-based prompt defenses leave substantial residual attack success. Training-time defenses teach the model the trust ordering itself—Wallace et al. ⁴ train models to prioritise system over user over tool/data instructions; StruQ ⁵ fine-tunes a model plus a special-token front-end to act only on the designated prompt channel; SecAlign ⁶ uses preference optimisation, reporting injection success below ~10%.

Yet Zhan et al. ⁷ broke all eight evaluated defenses with adaptive attacks (consistently >50% success), supporting the skeptical view that detection and probabilistic robustness are insufficient against an adaptive adversary. This motivates a third, system-level paradigm that constrains control- and data-flow outside the model—CaMeL ⁸—which provides provable guarantees but trades task utility for them. The unresolved debate is whether any approach reaches robust security without unacceptable capability cost. (Defense-layer grouping is Policy Window's editorial framing of these sources.)

History: from a 2022 naming to an agentic threat model

The phenomenon and its name are recent and traceable. In September 2022 Riley Goodside publicly demonstrated that GPT-3 could be made to disregard its instructions via crafted user input, and Simon Willison coined the term 'prompt injection' that same month, explicitly analogising it to SQL injection (Willison 2022, 'Prompt injection attacks against GPT-3'). The first systematic attack study followed shortly: Perez and Ribeiro ³ characterised goal-hijacking and prompt-leaking against GPT-3.

The pivotal conceptual extension was indirect prompt injection: Greshake et al. ¹ showed that adversarial instructions placed in content a model later retrieves (a webpage, a document) could compromise real LLM-integrated applications without the attacker touching the user's session—reframing the threat from a chat-window curiosity to a deployment-level security problem. Industry codified the risk in August 2023, when prompt injection entered the OWASP Top 10 for LLM Applications v1.0 as LLM01, its highest-ranked entry (OWASP 2023). Standardisation of measurement arrived in 2024 with formal frameworks and agentic benchmarks—Liu et al. ² and AgentDojo ⁹. By 2025 the framing had shifted to agent exploitation, captured in Willison's 'lethal trifecta' formulation—private-data access, exposure to untrusted content, and external communication—as the conjunction that makes injection exfiltration-capable (Willison 2025).

Use in governance

How instruments operationalise this concept

Instrument	Jurisdiction	Status
EU AI Act	EU	in force
NIST AI RMF Generative AI Profile	US	in force

Appears in topic articles

Social-science evidence — the “so-what”

What the peer-reviewed social science shows: whether the harm this concept addresses is empirically real, and whether governance of it works. The badge is the epistemic status of the evidence(not the policy debate) — “thin” or “absent” efficacy evidence is itself a finding (the “second silence”). Each epistemic-status label is Policy Window's editorial assessment of the cited evidence base (a structured classification), not a verdict any single source issues.

Is the harm real?evidence: established
Prompt injection is empirically well-established and benchmarked at frontier scale. Perez & Ribeiro 2022 demonstrated direct injection (goal-hijacking and prompt-leaking) against GPT-3, Greshake et al. 2023 showed INDIRECT injection compromising real-world LLM-integrated applications (including then-deployed Bing Chat and GPT-4) via retrieved untrusted content, and AgentDojo (Debenedetti et al. 2024) provides a standardized dynamic benchmark on which injection attacks can reliably succeed against current tool-using agents. Because the phenomenon is shown against deployed systems and replicated across a standardized benchmark (not merely toy settings), the 'established' status is warranted. Caveat: success rates vary by attack, model, and scaffolding, and AgentDojo's own finding is that existing attacks break some security properties but not all.
Sources: Perez & Ribeiro 2022 (Ignore Previous Prompt: Attack Techniques for Language Models, Best Paper, NeurIPS ML Safety Workshop, arXiv:2211.09527); Greshake et al. 2023 (Not What You've Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection, ACM AISec '23, arXiv:2302.12173); Debenedetti et al. 2024 (AgentDojo, NeurIPS Datasets & Benchmarks Track, arXiv:2406.13352)
Does governance work?evidence: thin
Mitigations reduce but do not robustly eliminate prompt injection, and no governance regime has a validated impact evaluation. Zhan et al. 2025 broke all eight evaluated defenses with adaptive attacks (consistently >50% attack-success-rate), and the strongest design-level result to date, the capability/data-flow CaMeL architecture (Debenedetti et al. 2025) — which constrains control/data flows rather than detecting injection — still solves only 77% of AgentDojo tasks with provable security (vs 84% for an undefended system), i.e. it trades utility for a security guarantee and does not reach full robust task completion. No replicated study shows a policy lever (disclosure, certification, or filtering mandate) measurably curbs downstream injection harm.
Sources: Zhan et al. 2025 (Adaptive Attacks Break Defenses Against Indirect Prompt Injection Attacks on LLM Agents, Findings of NAACL 2025, arXiv:2503.00061); Debenedetti et al. 2025 (Defeating Prompt Injections by Design / CaMeL, arXiv:2503.18813)

Editorial note

Distinguish prompt injection (instruction-channel attack via shared token stream) from jailbreaking (adversarial-prompt attack targeting alignment training) and from data poisoning (training-time attack). The three are often conflated in policy text but require different mitigations. Currency 2026-06-21. The definition is accurate and prompt injection remains OWASP LLM01, the top LLM application vulnerability, in 2026. One notable new data point since the iter-443 review is the multi-lab paper The Attacker Moves Second by Nasr, Carlini, Tramer et al. from OpenAI, Anthropic and Google DeepMind, arXiv 2510.09023, October 2025, which broke 12 recent defenses at over 90 percent adaptive attack success, reinforcing the article existing skeptical framing and offering a stronger candidate citation than the Zhan et al. 2025 over-50-percent figure in the governance-efficacy evidence base.

References

Sources cited inline in the analysis, numbered in order of appearance.

Greshake, K., Abdelnabi, S., Mishra, S., Endres, C., Holz, T., Fritz, M. (2023), 'Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection.' Prompt Injection. arXiv:2302.12173 — Greshake, K., Abdelnabi, S., Mishra, S., Endres, C., Holz, T., Fritz, M. (2023), 'Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection.' ↩
arXiv:2310.12815 ↩
arXiv:2211.09527 ↩
arXiv:2404.13208 ↩
arXiv:2402.06363 ↩
arXiv:2410.05451 ↩
arXiv:2503.00061 ↩
arXiv:2503.18813 ↩
arXiv:2406.13352 ↩

Cite this article 8 formats · BibTeX, RIS, APA, Chicago, … · 1-click copy

@misc{policywindow-prompt-injection,
  title  = {Prompt Injection},
  author = {Policy Window},
  year   = {n.d.},
  howpublished = {prompt-injection — safety},
  url    = {https://policywindow.org/wiki/prompt-injection},
  note   = {Primary source: https://arxiv.org/abs/2302.12173}
}

Verify the year + paste-and-refine. Primary source linked in BibTeX/RIS note.

Permalink downloads.bib .ris .csl.json

Persistent identifier: https://policywindow.org/wiki/prompt-injection — committed-stable URL with content-versioning via ?asOf= (rollout pending per methodology §7). DOIs via Zenodo are on the roadmap.

Article tools — track changes, suggest an edit

View history — every captured revision of this article · What links here

Source: Edit on GitHub (search for `prompt-injection`)

[ref-1] Greshake, K., Abdelnabi, S., Mishra, S., Endres, C., Holz, T., Fritz, M. (2023), 'Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection.' Prompt Injection. arXiv:2302.12173 — Greshake, K., Abdelnabi, S., Mishra, S., Endres, C., Holz, T., Fritz, M. (2023), 'Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection.' ↩

[ref-2] arXiv:2310.12815 ↩

[ref-3] arXiv:2211.09527 ↩

[ref-4] arXiv:2404.13208 ↩

[ref-5] arXiv:2402.06363 ↩

[ref-6] arXiv:2410.05451 ↩

[ref-7] arXiv:2503.00061 ↩

[ref-8] arXiv:2503.18813 ↩

[ref-9] arXiv:2406.13352 ↩

Prompt Injection

Definition & scope

Mechanism: why models cannot separate data from instructions

Open debate: is prompt injection solvable, and at which layer?

History: from a 2022 naming to an agentic threat model

Use in governance

How instruments operationalise this concept

Appears in topic articles

Editorial note

See also

Further reading

References

Prompt Injection

Definition & scope

Mechanism: why models cannot separate data from instructions

Open debate: is prompt injection solvable, and at which layer?

History: from a 2022 naming to an agentic threat model

Use in governance

How instruments operationalise this concept

Appears in topic articles

Editorial note

See also

Further reading

References