A training-time attack in which an adversary inserts crafted examples into the training corpus or fine-tuning dataset to alter the resulting model's behaviour — typically inserting a backdoor that triggers on a specific input pattern or degrading performance on a target class.
Definition and scope
Data poisoning is the canonical training-time adversarial attack. The lineage runs from Biggio et al. (2012, 'Poisoning Attacks against Support Vector Machines') through targeted backdoor attacks on deep networks (Gu et al. 2017, 'BadNets'; Chen et al. 2017) to recent work on foundation-model corpora (Carlini et al. 2024, 'Poisoning Web-Scale Training Datasets is Practical'). Two sub-cases matter: (a) targeted poisoning — adversary inserts examples to cause specific misclassification or backdoor on a trigger; (b) untargeted poisoning — adversary degrades overall performance, often as denial-of-service. For foundation models trained on web-scale corpora (Common Crawl, LAION), the practicality bar is low: Carlini et al. (2024) demonstrated that injecting poisoned examples into ~0.01% of the training corpus is feasible for an attacker controlling a handful of expired domains. Governance relevance is direct and increasingly cited. NIST AI RMF GenAI Profile (NIST AI 600-1) §2.6 'Information Security' names data poisoning. EU AI Act Art. 15 cybersecurity obligations + Art. 55 systemic-risk obligations require protection against 'attempts to alter the use, behaviour or performance of the system' which covers training-time attacks. China's GenAI Measures Art. 7 mandates legal-source training data, which intersects with poisoning resistance. The governance gap: poisoning resistance is hard to verify post-hoc — once a model is trained, distinguishing poisoned-but-undetected from clean is an open problem. For open-data + open-weight foundation models (Pile, RedPajama, Llama series), poisoning resistance must be designed in at curation time.
Used by these instruments
Related concepts
- AI Supply Chain— The end-to-end pipeline of inputs, intermediate artefacts, and downstream applications by which an A
- Training-Data Attribution— Technical methods that identify which training examples most influenced a specific AI model output,
- Model Distillation Risk— The risk that a closed-weight frontier model's capabilities can be partially recovered by training a
- Jailbreak Resistance— The robustness of an AI model's safety training against adversarial prompts crafted to elicit policy
- Prompt Injection— An adversarial input technique in which untrusted content fed to an AI model (e.g., text on a webpag
Appears in topic articles
Editorial note
Distinguish data poisoning (training-time corpus attack) from prompt injection (inference-time input attack) and from model distillation risk (post-training capability leak). All three are sometimes conflated under 'adversarial attacks on LLMs' but require distinct mitigations.
References
Take this further — sign up free
Save, compare, or get alerts when Data Poisoning changes. Policy Window is the analyst workbench layered on top of this wiki — free for researchers, civil society, and verified policymakers.