Red-Team Evaluation

Policy Window Editorial Board

Red-Team Evaluation

red-team-evaluation · Frontier safety

Concept

Tools

Last verified 2026-06-21

Cite Share PDF

Structured adversarial probing of an AI model's capabilities and behaviour before deployment, designed to elicit failures that ordinary evaluation would miss.

Definition & scope

Field consensus on this concept:contested

Red-team evaluation originated in cybersecurity (penetration testing) and was adapted to AI by the 2022 DEF CON Generative Red Team event and later codified in the 2023 White House voluntary commitments. EU AI Act Art. 55(1)(a) requires adversarial testing for general-purpose AI models with systemic risk. US EO 14110 §4.2(a)(i) required reporting of red-team results for foundation models above the compute threshold (rescinded when EO 14148 revoked EO 14110, Jan 2025). G7 Hiroshima Code §1 calls for 'adversarial testing prior to and throughout deployment.' Anthropic, OpenAI, and Google DeepMind each maintain internal red-team programs with public methodology disclosures. Governance disputes centre on: (1) WHO must red-team (provider, independent third-party, government); (2) WHAT capabilities are in scope (CBRN uplift, autonomous replication, election manipulation, etc.); (3) WHO sees the results (provider only, regulator under confidentiality, public); (4) WHAT triggers re-evaluation after deployment.

Locus of dispute: WHO must red-team (provider, independent third-party, regulator), WHAT capabilities are in scope (CBRN uplift, autonomous replication, election manipulation), and WHO sees the results (provider only, regulator under confidentiality, public)? Field convergence post-Seoul 2024 is slow.

Precise Definition and Boundary Distinctions

Red-team evaluation occupies a specific niche within the AI assurance toolkit, separated from adjacent practices by two features: it is conducted pre-deployment and is animated by adversarial intent. This distinguishes it from general evaluation, which measures capability against fixed benchmarks under cooperative conditions, and from audit, the post-hoc third-party review of a deployed system. Where benchmark evaluation asks what a model typically does, red-teaming asks what a determined adversary can make it do. The boundary is contested at the edges: the EU AI Act folds 'adversarial testing' into the broader systemic-risk regime of Art. 55(1)(a), and definitional instability across the regime is well documented, with the Act's text shifting among the terms 'AI system, general purpose AI system, foundation model, and generative AI' across drafts ¹. This terminological flux complicates precisely which artefacts must be red-teamed and when.

Mechanisms and Capability Scope

In practice red-teaming probes for latent dangerous capabilities rather than average-case performance, targeting failure modes that benchmark sampling would not surface: CBRN uplift, autonomous replication, and election or information manipulation. The provider frameworks named in the record — Anthropic's RSP, OpenAI's Preparedness, and DeepMind's Frontier Safety Framework — each tie red-team findings to capability thresholds that gate deployment. The election-manipulation surface is itself an active research frontier; experiments on political-speech deepfakes find that 'audio and visual information enables more accurate discernment than text alone' ², implying red-team protocols confined to text under-test the modalities humans actually rely on to detect manipulation. Because general-purpose models exhibit broad task reach — roughly 80% of the U.S. workforce 'could have at least 10% of their work tasks affected' ³ — the in-scope capability set is open-ended, and protocols must continually expand to track emergent behaviours, not merely re-run a fixed adversarial suite.

Governance Relevance and Instrument Engagement

Red-team evaluation is now load-bearing across binding and voluntary regimes, but the obligations differ sharply in force. The EU AI Act, Regulation (EU) 2024/1689, gives the practice its hardest legal footing: Art. 55(1)(a) requires adversarial testing for general-purpose AI models with systemic risk. The 2023 US Executive Order 14110 §4.2(a)(i) had required reporting of red-team results for foundation models above a compute threshold, but that reporting duty was rescinded when EO 14148 revoked EO 14110 (Jan 2025), illustrating how fragile administratively-grounded mandates are. Softer instruments — the G7 Hiroshima Code §1, which calls for 'adversarial testing prior to and throughout deployment,' plus the White House voluntary commitments and the UK-US AISI memorandum — supply norms without enforcement. Scholarship on the EU model notes the difficulty of fitting general-purpose systems whose 'autonomous content generation challenges legal categories of authorship, accountability, and control' into a risk-tiered structure ⁴, a tension that determines which models cross the systemic-risk line that triggers mandatory red-teaming at all.

Debates and Open Questions

The empirical consensus on red-team evaluation is contested, and the disputes are structural rather than technical. Three questions remain unsettled: WHO must red-team — the provider itself, an independent third party, or a government body such as an AI Safety Institute; WHAT capabilities fall in scope — CBRN uplift, autonomous replication, election manipulation, or some narrower set; and WHO sees the results — the provider alone, a regulator under confidentiality, or the public. Field convergence after the Seoul 2024 summit has been slow. A linked vulnerability is scope evasion: because EU obligations attach above a compute threshold, 'enhancement techniques that are capable of decreasing training compute usage while preserving... model capabilities' ⁵ let a provider keep a capable model below the line that would compel red-teaming. Compute itself is argued to be a uniquely governable lever, being 'detectable, excludable, and quantifiable, and is produced via an extremely concentrated supply chain' ⁶, yet that very leverage is what loophole engineering targets — leaving the trigger for mandatory adversarial testing, and thus the regime's reach, genuinely open.

Use in governance

How instruments operationalise this concept

Instrument	Jurisdiction	Status
EU AI Act	EU	in force
Executive Order 14110 on Safe, Secure, Trustworthy AI	US	partial
G7 Hiroshima AI Process Code of Conduct	G7	in force
UK Pro-Innovation Approach to AI Regulation (White Paper)	UK	in force
Anthropic Responsible Scaling Policy (RSP) v2	US	in force
OpenAI Preparedness Framework	US	in force
Google DeepMind Frontier Safety Framework	US	in force
Meta Frontier AI Framework	US	in force
UK-US AI Safety Institute Memorandum of Understanding	global	in force
White House Voluntary AI Commitments	US	in force
Singapore Model AI Governance Framework for Generative AI	SG	in force
Japan METI AI Guidelines for Business	JP	in force
EU General-Purpose AI Code of Practice	EU	in force

Appears in topic articles

Social-science evidence — the “so-what”

What the peer-reviewed social science shows: whether the harm this concept addresses is empirically real, and whether governance of it works. The badge is the epistemic status of the evidence(not the policy debate) — “thin” or “absent” efficacy evidence is itself a finding (the “second silence”). Each epistemic-status label is Policy Window's editorial assessment of the cited evidence base (a structured classification), not a verdict any single source issues.

Is the harm real?evidence: established
The underlying phenomenon is well-established: structured adversarial probing reliably surfaces failures that ordinary benchmarks miss. Perez et al. 2022 used a language model to red-team another, automatically eliciting tens of thousands of diverse harmful outputs (e.g., tens of thousands of offensive replies from a 280B-parameter chatbot), and Ganguli et al. 2022 ran a large manual red-teaming effort across model sizes (2.7B/13B/52B) and types that discovered, measured, and helped reduce harms, finding that RLHF models grew more difficult to red-team (more attack-resistant) with scale while other model types stayed flat. Caveat: 'red-team evaluation' names a heterogeneous family of practices rather than a single defined procedure, so coverage and rigor vary widely across exercises.
Sources: Perez et al. 2022 (Red Teaming Language Models with Language Models, EMNLP/arXiv:2202.03286); Ganguli et al. 2022 (Red Teaming Language Models to Reduce Harms, arXiv:2209.07858)
Does governance work?evidence: thin
There is no rigorous evidence that red-teaming as a governance requirement durably reduces deployed harm, and no agreed standard for who must red-team, what is in scope, or to what depth: Feffer et al. 2024 (AIES) survey industry practice and the literature and argue it is poorly structured, non-comprehensive, composition-biased, and rarely transparently reported, warning that treating red-teaming as a panacea verges on 'security theater.' Zou et al. 2023 further show that even aligned, safety-trained models remain breakable by automated transferable adversarial attacks (universal suffixes optimized on open models transfer to GPT-3.5/4, Bard/PaLM-2, and Claude), so passing a red-team exercise does not establish robustness. Evidence that the governance lever works is thin.
Sources: Feffer et al. 2024 (Red-Teaming for Generative AI: Silver Bullet or Security Theater?, AIES/arXiv:2401.15897); Zou et al. 2023 (Universal and Transferable Adversarial Attacks on Aligned Language Models, arXiv:2307.15043)

Editorial note

Distinguish from 'evaluation' (general benchmark-style measurement) and 'audit' (post-hoc third-party review). Red-teaming is specifically pre-deployment + adversarial-intent.

References

Sources cited inline in the analysis (linked from the superscript markers), then the primary instrument sources behind the classifications.

David Fernández-Llorca, Emilia Gómez, Ignacio Sánchez, Gabriele Mazzini (2025) An interdisciplinary account of the terminological choices by EU policymakers ahead of the final agreement on the AI Act: AI system, general purpose AI system, foundation model, and generative AI, Artificial Intelligence and Law. 10.1007/s10506-024-09412-y — Traces how the AI Act's legal text shifted across versions among the terms 'AI system, general purpose AI system, foundation model, and generative AI', exposing definitional instability in the regime. ↩
Groh, Sankaranarayanan, Singh, Kim, Lippman, Picard (2024) Human detection of political speech deepfakes across transcripts, audio, and video, Nature Communications. 10.1038/s41467-024-51998-z — Experiments show "audio and visual information enables more accurate discernment than text alone" — humans rely more on how something is said than on transcript content. ↩
Eloundou, Manning, Mishkin, Rock (2024) GPTs are GPTs: Labor market impact potential of LLMs, Science. 10.1126/science.adj0998 — Finds around 80% of the U.S. workforce "could have at least 10% of their work tasks affected" by LLMs, which exhibit "traits of general-purpose technologies". ↩
Martina Hulok (2025) The EU model of AI governance: regulating artificial intelligence through law and policy, ERA Forum. 10.1007/s12027-025-00869-1 — Analyses how the AI Act's risk-based model handles general-purpose and foundation models whose 'autonomous content generation challenges legal categories of authorship, accountability, and control'. ↩
Matteo Pistillo, Pablo Villalobos (2025) Defending Compute Thresholds Against Legal Loopholes, arXiv (cs.CY). arXiv:2502.00003 — Identifies 'enhancement techniques that are capable of decreasing training compute usage while preserving... model capabilities', exposing loopholes in compute-reporting thresholds. ↩
Sastry, Heim, Belfield, Anderljung, Brundage, et al. (2024) Computing Power and the Governance of Artificial Intelligence, arXiv. arXiv:2402.08797 — Argues compute is a uniquely governable lever because it is "detectable, excludable, and quantifiable, and is produced via an extremely concentrated supply chain". ↩
EU AI Act Art. 55(1)(a) — the most binding articulation

Cite this article 8 formats · BibTeX, RIS, APA, Chicago, … · 1-click copy

@misc{policywindow-red-team-evaluation,
  title  = {Red-Team Evaluation},
  author = {Policy Window},
  year   = {n.d.},
  howpublished = {red-team-evaluation — safety},
  url    = {https://policywindow.org/wiki/red-team-evaluation},
  note   = {Primary source: https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX%3A32024R1689}
}

Verify the year + paste-and-refine. Primary source linked in BibTeX/RIS note.

Permalink downloads.bib .ris .csl.json

Persistent identifier: https://policywindow.org/wiki/red-team-evaluation — committed-stable URL with content-versioning via ?asOf= (rollout pending per methodology §7). DOIs via Zenodo are on the roadmap.

Article tools — track changes, suggest an edit

View history — every captured revision of this article · What links here

Source: Edit on GitHub (search for `red-team-evaluation`)

[ref-1] David Fernández-Llorca, Emilia Gómez, Ignacio Sánchez, Gabriele Mazzini (2025) An interdisciplinary account of the terminological choices by EU policymakers ahead of the final agreement on the AI Act: AI system, general purpose AI system, foundation model, and generative AI, Artificial Intelligence and Law. 10.1007/s10506-024-09412-y — Traces how the AI Act's legal text shifted across versions among the terms 'AI system, general purpose AI system, foundation model, and generative AI', exposing definitional instability in the regime. ↩

[ref-2] Groh, Sankaranarayanan, Singh, Kim, Lippman, Picard (2024) Human detection of political speech deepfakes across transcripts, audio, and video, Nature Communications. 10.1038/s41467-024-51998-z — Experiments show "audio and visual information enables more accurate discernment than text alone" — humans rely more on how something is said than on transcript content. ↩

[ref-3] Eloundou, Manning, Mishkin, Rock (2024) GPTs are GPTs: Labor market impact potential of LLMs, Science. 10.1126/science.adj0998 — Finds around 80% of the U.S. workforce "could have at least 10% of their work tasks affected" by LLMs, which exhibit "traits of general-purpose technologies". ↩

[ref-4] Martina Hulok (2025) The EU model of AI governance: regulating artificial intelligence through law and policy, ERA Forum. 10.1007/s12027-025-00869-1 — Analyses how the AI Act's risk-based model handles general-purpose and foundation models whose 'autonomous content generation challenges legal categories of authorship, accountability, and control'. ↩

[ref-5] Matteo Pistillo, Pablo Villalobos (2025) Defending Compute Thresholds Against Legal Loopholes, arXiv (cs.CY). arXiv:2502.00003 — Identifies 'enhancement techniques that are capable of decreasing training compute usage while preserving... model capabilities', exposing loopholes in compute-reporting thresholds. ↩

[ref-6] Sastry, Heim, Belfield, Anderljung, Brundage, et al. (2024) Computing Power and the Governance of Artificial Intelligence, arXiv. arXiv:2402.08797 — Argues compute is a uniquely governable lever because it is "detectable, excludable, and quantifiable, and is produced via an extremely concentrated supply chain". ↩

[ref-7] EU AI Act Art. 55(1)(a) — the most binding articulation

Red-Team Evaluation

Definition & scope

Precise Definition and Boundary Distinctions

Mechanisms and Capability Scope

Governance Relevance and Instrument Engagement

Debates and Open Questions

Use in governance

How instruments operationalise this concept

Appears in topic articles

Editorial note

See also

Further reading

References

Red-Team Evaluation

Definition & scope

Precise Definition and Boundary Distinctions

Mechanisms and Capability Scope

Governance Relevance and Instrument Engagement

Debates and Open Questions

Use in governance

How instruments operationalise this concept

Appears in topic articles

Editorial note

See also

Further reading

References