Data Poisoning

Policy Window Editorial Board

Data Poisoning

data-poisoning · Frontier safety

Concept

Tools

Last verified 2026-06-21

Cite Share PDF

A training-time attack in which an adversary inserts crafted examples into the training corpus or fine-tuning dataset to alter the resulting model's behaviour — typically inserting a backdoor that triggers on a specific input pattern or degrading performance on a target class.

Definition & scope

Field consensus on this concept:settled

Data poisoning is the canonical training-time adversarial attack. The lineage runs from Biggio et al. (2012, 'Poisoning Attacks against Support Vector Machines') through targeted backdoor attacks on deep networks (Gu et al. 2017, 'BadNets'; Chen et al. 2017) to recent work on foundation-model corpora (Carlini et al. 2024, 'Poisoning Web-Scale Training Datasets is Practical'). Two sub-cases matter: (a) targeted poisoning — adversary inserts examples to cause specific misclassification or backdoor on a trigger; (b) untargeted poisoning — adversary degrades overall performance, often as denial-of-service. For foundation models trained on web-scale corpora (Common Crawl, LAION), the practicality bar is low: Carlini et al. (2024) demonstrated that injecting poisoned examples into ~0.01% of the training corpus is feasible for an attacker controlling a handful of expired domains. Governance relevance is direct and increasingly cited. NIST AI RMF GenAI Profile (NIST AI 600-1) §2.6 'Information Security' names data poisoning. EU AI Act Art. 15 cybersecurity obligations + Art. 55 systemic-risk obligations require protection against 'attempts to alter the use, behaviour or performance of the system' which covers training-time attacks. China's GenAI Measures Art. 7 mandates legal-source training data, which intersects with poisoning resistance. The governance gap: poisoning resistance is hard to verify post-hoc — once a model is trained, distinguishing poisoned-but-undetected from clean is an open problem. For open-data + open-weight foundation models (Pile, RedPajama, Llama series), poisoning resistance must be designed in at curation time.

Mechanism and Attack Taxonomy

Data poisoning operates at training time: an adversary perturbs the corpus before optimisation, so the learned parameters themselves encode the malicious behaviour. The canonical division is between targeted poisoning — inserting examples that bind a specific trigger pattern to an attacker-chosen output, as in the BadNets backdoor lineage (Gu et al. 2017) — and untargeted poisoning, which degrades aggregate accuracy as a denial-of-service. The threat lineage traces from Biggio et al. (2012) on support vector machines to deep-network backdoors (Chen et al. 2017). What changed for foundation models is the practicality bar: Carlini et al. ¹ show that for roughly US$60 an attacker can poison about 0.01% of web-scale corpora such as LAION-400M or COYO-700M — via split-view poisoning (exploiting the mutable nature of web content) and frontrunning/expired-domain control — making frontier-corpus poisoning realistic rather than theoretical. The same web-scale ingestion that enables it also amplifies its reach, since models that 'memorize and leak pieces of training data' ² can propagate a planted artefact at inference, not merely retain it.

Distinguishing Adjacent Attack Surfaces

Precision matters because data poisoning, prompt injection, and model-distillation risk are frequently conflated under 'adversarial attacks on LLMs' yet demand distinct mitigations. Poisoning is a training-time corpus attack: the defence is curation-time provenance and integrity control. Prompt injection is an inference-time input attack against a fixed model, mitigated by input handling and isolation. Distillation risk is a post-training capability leak. The conflation has governance consequences because instruments that name only one surface may leave the others unaddressed. The provenance problem underlying poisoning is empirically severe: the Data Provenance Initiative's audit of over 1,800 training datasets found 'licence omission rates of more than 70% and error rates of more than 50%' on popular hosting sites ³, meaning the curation discipline that poisoning resistance presupposes is largely absent. That discipline is also eroding fast — Longpre et al. ⁴ document a 2023–24 surge in crawl restrictions, with '~5%+ of all tokens in C4' becoming fully restricted within a single year, destabilising the very corpora whose integrity poisoning defences must vouch for.

Governance Relevance and Instrument Coverage

Three regimes engage data poisoning directly. The EU AI Act's Art. 15 cybersecurity obligations require high-risk systems to be resilient against attempts to 'alter their use, outputs or performance by exploiting system vulnerabilities' — and Art. 15(5) names data poisoning explicitly — while Art. 55 layers systemic-risk duties onto general-purpose models. The definitional instability across the Act's drafting — documented by Fernández-Llorca et al. (2025) as a shifting vocabulary of 'AI system, general purpose AI system, foundation model, and generative AI' ⁵ — complicates pinning poisoning duties to a stable addressee, a categorisation strain Hulok (2025) traces to autonomous content generation that 'challenges legal categories of authorship, accountability' ⁶. The NIST AI RMF GenAI Profile (NIST AI 600-1) §2.9 'Information Security' names poisoning explicitly, and China's GenAI Measures Art. 7 mandates lawful-source training data, intersecting poisoning resistance with provenance control. Novelli et al. (2024) survey how the Act's cybersecurity layer interacts with liability and GDPR rules ⁷.

Open Questions and the Verification Gap

Although the empirical feasibility of poisoning is settled, its governance remains unresolved because resistance is hard to verify post-hoc: once a model is trained, distinguishing a poisoned-but-undetected model from a clean one is an open problem, so for open-data and open-weight foundation models (Pile, RedPajama, Llama) resistance must be engineered at curation time rather than audited afterward. This shifts weight onto upstream controls whose legal footing is itself contested — the EU TDM regime that governs corpus assembly faces practical obstacles documented after the LAION litigation ⁸, including robots.txt and machine-readability gaps, while Radeisen (2026) argues Art. 3 CDSM offers a research 'safe harbor' for open models ⁹. Audit-based assurance is no panacea: Terzis, Veale and Gaumann (2024) warn that audit markets 'can entrench rather than constrain power' absent underlying governance ¹⁰, and Sterz et al. (2024) find mandated human oversight is often a 'rubber-stamp' unless effectiveness conditions are explicitly designed in ¹¹, leaving the verification gap structurally unclosed.

Use in governance

How instruments operationalise this concept

Instrument	Jurisdiction	Status
EU AI Act	EU	in force
NIST AI RMF Generative AI Profile	US	in force
Interim Measures for Generative AI Service Management	CN	in force

Appears in topic articles

Social-science evidence — the “so-what”

What the peer-reviewed social science shows: whether the harm this concept addresses is empirically real, and whether governance of it works. The badge is the epistemic status of the evidence(not the policy debate) — “thin” or “absent” efficacy evidence is itself a finding (the “second silence”). Each epistemic-status label is Policy Window's editorial assessment of the cited evidence base (a structured classification), not a verdict any single source issues.

Is the harm real?evidence: established
Data poisoning is empirically well-established across the ML supply chain: BadNets (Gu et al. 2017) demonstrated backdoor injection via poisoned training data, and Carlini et al. 2023 showed poisoning real web-scale datasets is cheaply practical (~$60 to taint 0.01% of LAION-400M or COYO-700M via split-view/front-running attacks). For LLMs the threat reaches frontier scale: Wan et al. 2023 backdoored instruction-tuned models with ~100 poison examples; Zhang et al. 2024 showed pre-training poisons at small fractions (0.1%, simplest attacks 0.001%) persist through SFT/DPO across 600M-7B models; and a large 2025 study (Anthropic, UK AISI & Turing) found a small, near-constant count (~250 documents) suffices to backdoor models from 600M to 13B parameters regardless of clean-data volume, breaking proportion-based intuitions. CAVEAT (load-bearing): the headline demonstrations target narrow, trigger-conditioned, low-stakes behaviours — the 2025 near-constant result is a denial-of-service 'gibberish on <SUDO> trigger' backdoor, and its authors explicitly caution it may not extend to genuinely harmful backdoors (code/safety-guardrail bypass) or to larger frontier models; whether broadly harmful poisoning survives undetected in production frontier models is not directly measured.
Sources: Gu, Dolan-Gavitt & Garg 2017 (BadNets, arXiv:1708.06733); Carlini et al. 2023 (Poisoning Web-Scale Training Datasets Is Practical, arXiv:2302.10149); Wan, Wallace, Shen & Klein 2023 (Poisoning Language Models During Instruction Tuning, ICML/PMLR 202); Zhang et al. 2024 (Persistent Pre-Training Poisoning of LLMs, arXiv:2410.13722); Anthropic, UK AISI & Turing Institute 2025 (Poisoning Attacks on LLMs Require a Near-Constant Number of Poison Samples, arXiv:2510.07192)
Does governance work?evidence: thin
Rigorous evidence that any governance regime or mitigation reliably prevents poisoning is thin. Certified defenses (Levine & Feizi 2020) give provable robustness only against very small poison budgets (e.g. ~9 poison insertions certified on CIFAR-10) and at an accuracy cost inherent to training base models on disjoint data partitions, leaving an attacker-defender arms race. Pre-training poisons have been shown to persist through downstream SFT/DPO safety alignment (Zhang et al. 2024), and the near-constant-count result (Anthropic et al. 2025) implies proportion-based data-screening assumptions are insufficient. No impact evaluation shows that a supply-chain, disclosure, or other governance lever measurably reduces real-world poisoning harm.
Sources: Levine & Feizi 2020 (Deep Partition Aggregation: Provable Defense Against General Poisoning Attacks, arXiv:2006.14768; ICLR 2021); Zhang et al. 2024 (Persistent Pre-Training Poisoning of LLMs, arXiv:2410.13722); Anthropic, UK AISI & Turing Institute 2025 (Poisoning Attacks on LLMs Require a Near-Constant Number of Poison Samples, arXiv:2510.07192)

Editorial note

Distinguish data poisoning (training-time corpus attack) from prompt injection (inference-time input attack) and from model distillation risk (post-training capability leak). All three are sometimes conflated under 'adversarial attacks on LLMs' but require distinct mitigations.

References

Sources cited inline in the analysis, numbered in order of appearance.

Carlini, N., et al. (2024), 'Poisoning Web-Scale Training Datasets is Practical' — establishes practical feasibility of poisoning frontier-model training corpora. Data Poisoning. arXiv:2302.10149 — Carlini, N., et al. (2024), 'Poisoning Web-Scale Training Datasets is Practical' — establishes practical feasibility of poisoning frontier-model training corpora. ↩
Hannah Ruschemeier (2025) Generative AI and data protection, Cambridge Forum on AI: Law and Governance. 10.1017/cfl.2024.2 — Examines friction between foundation-model training and the GDPR, noting models that 'memorize and leak pieces of training data' cannot be treated as anonymous. ↩
Longpre, Mahari, et al. (Data Provenance Initiative) (2024) A large-scale audit of dataset licensing and attribution in AI, Nature Machine Intelligence. 10.1038/s42256-024-00878-8 — Audit of 1,800+ AI training datasets finds "licence omission rates of more than 70% and error rates of more than 50%" on popular hosting sites. ↩
Shayne Longpre, Robert Mahari, Ariel Lee, et al. (2024) Consent in Crisis: The Rapid Decline of the AI Data Commons, arXiv (Data Provenance Initiative; presented NeurIPS Dataset. arXiv:2407.14933 — Longitudinal audit of 14,000 web domains finds a 2023-24 surge in AI training restrictions, with '~5%+ of all tokens in C4...fully restricted from use' within a single year. ↩
David Fernández-Llorca, Emilia Gómez, Ignacio Sánchez, Gabriele Mazzini (2025) An interdisciplinary account of the terminological choices by EU policymakers ahead of the final agreement on the AI Act: AI system, general purpose AI system, foundation model, and generative AI, Artificial Intelligence and Law. 10.1007/s10506-024-09412-y — Traces how the AI Act's legal text shifted across versions among the terms 'AI system, general purpose AI system, foundation model, and generative AI', exposing definitional instability in the regime. ↩
Martina Hulok (2025) The EU model of AI governance: regulating artificial intelligence through law and policy, ERA Forum. 10.1007/s12027-025-00869-1 — Analyses how the AI Act's risk-based model handles general-purpose and foundation models whose 'autonomous content generation challenges legal categories of authorship, accountability, and control'. ↩
Novelli, Casolari, Hacker, Spedicato & Floridi (2024) Generative AI in EU law: Liability, privacy, intellectual property, and cybersecurity, Computer Law & Security Review. 10.1016/j.clsr.2024.106066 — Examines how the EU AI Act, liability regimes, GDPR, copyright and cybersecurity rules apply to generative AI, identifying gaps and proposing targeted regulatory refinements. ↩
Stepanka Havlikova (2025) Technical Challenges of Rightsholders' Opt-out From Gen AI Training after Robert Kneschke v. LAION, JIPITEC – Journal of Intellectual Property, Information Tech. source — Examines post-LAION practical obstacles to the EU TDM opt-out (robots.txt, machine-readability, memorisation): 'While the TDM exceptions may seem workable in theory, implementing them in practice presents a variety of practical… ↩
Arne Radeisen (2026) Open Foundation Models and TDM Exceptions to Copyright – Building Blocks for an AI Ecosystem, GRUR International. 10.1093/grurint/ikag002 — Argues Art. 3 CDSM Directive's scientific-research TDM exception 'does not grant rightsholders any control' and can be a 'safe harbor' for training openly released foundation models without licensing data. ↩
Petros Terzis, Michael Veale, Noëlle Gaumann (2024) Law and the Emerging Political Economy of Algorithmic Audits, Proceedings of the 2024 ACM Conference on Fairness, Accounta. 10.1145/3630106.3658970 — Analyses how AI-audit mandates create a new political economy of auditing, warning that audit markets can entrench rather than constrain power without underlying governance. ↩
Sarah Sterz, Kevin Baum, Sebastian Biewer, Holger Hermanns, Anne Lauber-Rönsberg, Philip Meinel, Markus Langer (2024) On the Quest for Effectiveness in Human Oversight: Interdisciplinary Perspectives, Proceedings of the 2024 ACM Conference on Fairness, Accounta. 10.1145/3630106.3659051 — Synthesises interdisciplinary evidence to argue that legally mandated human oversight of AI is often ineffective ('rubber-stamp') unless effectiveness conditions are explicitly designed for. ↩

Cite this article 8 formats · BibTeX, RIS, APA, Chicago, … · 1-click copy

@misc{policywindow-data-poisoning,
  title  = {Data Poisoning},
  author = {Policy Window},
  year   = {n.d.},
  howpublished = {data-poisoning — safety},
  url    = {https://policywindow.org/wiki/data-poisoning},
  note   = {Primary source: https://arxiv.org/abs/2302.10149}
}

Verify the year + paste-and-refine. Primary source linked in BibTeX/RIS note.

Permalink downloads.bib .ris .csl.json

Persistent identifier: https://policywindow.org/wiki/data-poisoning — committed-stable URL with content-versioning via ?asOf= (rollout pending per methodology §7). DOIs via Zenodo are on the roadmap.

Article tools — track changes, suggest an edit

View history — every captured revision of this article · What links here

Source: Edit on GitHub (search for `data-poisoning`)

[ref-1] Carlini, N., et al. (2024), 'Poisoning Web-Scale Training Datasets is Practical' — establishes practical feasibility of poisoning frontier-model training corpora. Data Poisoning. arXiv:2302.10149 — Carlini, N., et al. (2024), 'Poisoning Web-Scale Training Datasets is Practical' — establishes practical feasibility of poisoning frontier-model training corpora. ↩

[ref-2] Hannah Ruschemeier (2025) Generative AI and data protection, Cambridge Forum on AI: Law and Governance. 10.1017/cfl.2024.2 — Examines friction between foundation-model training and the GDPR, noting models that 'memorize and leak pieces of training data' cannot be treated as anonymous. ↩

[ref-3] Longpre, Mahari, et al. (Data Provenance Initiative) (2024) A large-scale audit of dataset licensing and attribution in AI, Nature Machine Intelligence. 10.1038/s42256-024-00878-8 — Audit of 1,800+ AI training datasets finds "licence omission rates of more than 70% and error rates of more than 50%" on popular hosting sites. ↩

[ref-4] Shayne Longpre, Robert Mahari, Ariel Lee, et al. (2024) Consent in Crisis: The Rapid Decline of the AI Data Commons, arXiv (Data Provenance Initiative; presented NeurIPS Dataset. arXiv:2407.14933 — Longitudinal audit of 14,000 web domains finds a 2023-24 surge in AI training restrictions, with '~5%+ of all tokens in C4...fully restricted from use' within a single year. ↩

[ref-5] David Fernández-Llorca, Emilia Gómez, Ignacio Sánchez, Gabriele Mazzini (2025) An interdisciplinary account of the terminological choices by EU policymakers ahead of the final agreement on the AI Act: AI system, general purpose AI system, foundation model, and generative AI, Artificial Intelligence and Law. 10.1007/s10506-024-09412-y — Traces how the AI Act's legal text shifted across versions among the terms 'AI system, general purpose AI system, foundation model, and generative AI', exposing definitional instability in the regime. ↩

[ref-6] Martina Hulok (2025) The EU model of AI governance: regulating artificial intelligence through law and policy, ERA Forum. 10.1007/s12027-025-00869-1 — Analyses how the AI Act's risk-based model handles general-purpose and foundation models whose 'autonomous content generation challenges legal categories of authorship, accountability, and control'. ↩

[ref-7] Novelli, Casolari, Hacker, Spedicato & Floridi (2024) Generative AI in EU law: Liability, privacy, intellectual property, and cybersecurity, Computer Law & Security Review. 10.1016/j.clsr.2024.106066 — Examines how the EU AI Act, liability regimes, GDPR, copyright and cybersecurity rules apply to generative AI, identifying gaps and proposing targeted regulatory refinements. ↩

[ref-8] Stepanka Havlikova (2025) Technical Challenges of Rightsholders' Opt-out From Gen AI Training after Robert Kneschke v. LAION, JIPITEC – Journal of Intellectual Property, Information Tech. source — Examines post-LAION practical obstacles to the EU TDM opt-out (robots.txt, machine-readability, memorisation): 'While the TDM exceptions may seem workable in theory, implementing them in practice presents a variety of practical… ↩

[ref-9] Arne Radeisen (2026) Open Foundation Models and TDM Exceptions to Copyright – Building Blocks for an AI Ecosystem, GRUR International. 10.1093/grurint/ikag002 — Argues Art. 3 CDSM Directive's scientific-research TDM exception 'does not grant rightsholders any control' and can be a 'safe harbor' for training openly released foundation models without licensing data. ↩

[ref-10] Petros Terzis, Michael Veale, Noëlle Gaumann (2024) Law and the Emerging Political Economy of Algorithmic Audits, Proceedings of the 2024 ACM Conference on Fairness, Accounta. 10.1145/3630106.3658970 — Analyses how AI-audit mandates create a new political economy of auditing, warning that audit markets can entrench rather than constrain power without underlying governance. ↩

[ref-11] Sarah Sterz, Kevin Baum, Sebastian Biewer, Holger Hermanns, Anne Lauber-Rönsberg, Philip Meinel, Markus Langer (2024) On the Quest for Effectiveness in Human Oversight: Interdisciplinary Perspectives, Proceedings of the 2024 ACM Conference on Fairness, Accounta. 10.1145/3630106.3659051 — Synthesises interdisciplinary evidence to argue that legally mandated human oversight of AI is often ineffective ('rubber-stamp') unless effectiveness conditions are explicitly designed for. ↩

Data Poisoning

Definition & scope

Mechanism and Attack Taxonomy

Distinguishing Adjacent Attack Surfaces

Governance Relevance and Instrument Coverage

Open Questions and the Verification Gap

Use in governance

How instruments operationalise this concept

Appears in topic articles

Editorial note

See also

Further reading

References

Data Poisoning

Definition & scope

Mechanism and Attack Taxonomy

Distinguishing Adjacent Attack Surfaces

Governance Relevance and Instrument Coverage

Open Questions and the Verification Gap

Use in governance

How instruments operationalise this concept

Appears in topic articles

Editorial note

See also

Further reading

References