Capability Elicitation

Policy Window Editorial Board

Capability Elicitation

capability-elicitation · Frontier safety

Concept

Tools

Last verified 2026-06-21

Cite Share PDF

Techniques designed to reveal the upper bounds of an AI model's capabilities, rather than measuring its default behaviour, so that downstream safety judgements can be calibrated to what the model *can* do under adversarial prompting or fine-tuning.

Definition & scope

Field consensus on this concept:emerging

Capability elicitation is methodologically distinct from benchmarking. A benchmark measures average performance under standard prompting; elicitation aims to surface the model's actual capability ceiling. Common methods: (a) adversarial prompting — red-team-style attempts to invoke a withheld behaviour (Branwen 2020, Weidinger et al. 2024); (b) chain-of-thought + structured prompting — forcing step-by-step reasoning, often revealing skills the model would otherwise hide or skip (Wei et al. 2022); (c) multi-stage / decomposition prompting — breaking tasks into sub-tasks that decompose deception incentives (Andersson 2024); (d) fine-tuning pressure — does the safety behaviour break under modest fine-tuning, indicating the underlying capability is preserved (Qi et al. 2023, 'Fine-tuning Aligned LLMs')? Governance relevance: EU AI Act Art. 55(1)(a) adversarial testing presupposes elicitation methods exist. US EO 14110 §4.2(a) reporting includes red-team results, which depend on elicitation methodology choices. The lack of standardisation across elicitation methods is one reason regulator-mandated evaluation results are not directly comparable across providers (Anthropic's elicitation suite ≠ OpenAI's ≠ DeepMind's). The Frontier Foundation Model Eval Consortium is attempting to converge methodology; consensus remains partial.

Locus of dispute: What is the right standardised elicitation methodology for regulator-mandated capability evaluation? Each frontier lab uses a different suite; Frontier Foundation Model Eval Consortium is converging slowly.

Definition and Distinctions from Adjacent Methods

Capability elicitation is an umbrella term for techniques that probe a model's capability ceiling — what it can do — rather than its default disposition. This frame separates it from two adjacent practices. Benchmarking measures average-case performance and yields a floor, not a ceiling: a model that scores poorly may still possess the latent skill under adversarial elicitation. That latent reach is broad — Eloundou et al. (2024) find LLMs show general-purpose-technology traits, with ~80% of US workers exposed on at least 10% of tasks ¹. Red-teaming, by contrast, is one adversarial procedure subsumed under it, not its synonym. The distinction matters for governance because a safety judgement calibrated to default behaviour can understate risk: a model that refuses by default may comply after modest fine-tuning, as Qi et al. (2023) show in ² — a gap that strengthens the frontier-regulation case for mandated standards over self-regulation ³. Elicitation thus reframes evaluation from observed conduct to demonstrated potential.

Mechanisms: How Capabilities Are Surfaced

The concept's methodology distinguishes four recurring mechanism families. Adversarial prompting constructs inputs that invoke a withheld behaviour the model would otherwise suppress. Chain-of-thought and structured prompting force step-by-step reasoning, often revealing skills the model skips under terse prompting. Multi-stage decomposition breaks a task into sub-tasks whose individual innocuousness erodes the model's incentive to refuse or dissemble. Finally, fine-tuning pressure tests whether a safety behaviour is shallow: Qi et al. ² show that alignment can be compromised even when users do not intend harm, implying the underlying capability was preserved beneath the guardrail. DeepMind's dangerous-capability pilots ⁴ operationalise such methods across persuasion, cyber-offence, and self-proliferation, reporting early warning signs but no present strong danger — an empirical reference point for evaluation-based gating.

Governance Relevance and Instrument Engagement

Elicitation is a silent precondition of several mandated evaluation regimes. The EU AI Act's Art. 55(1)(a) duty to conduct and document adversarial testing for systemic-risk GPAI presupposes that elicitation methods exist and are sound; US Executive Order 14110 §4.2(a) likewise folds red-team results into provider reporting, and those results are only as informative as the elicitation choices behind them. This dependency is consequential because evaluation outputs feed compute- and capability-based triggers documented across the governance-by-compute literature ^5,6. Yet definitional instability in the underlying regime — traced for the AI Act's shifting terms across drafts in ⁷ — compounds the problem: if 'foundation model' and 'GPAI' are unstable categories ⁸, the capability findings attached to them inherit that instability, weakening cross-provider comparability.

Debates and Open Questions

The central contested question is methodological: what standardised elicitation methodology should govern regulator-mandated capability evaluation? Today each frontier lab runs a bespoke suite — Anthropic's, OpenAI's, and DeepMind's are not interchangeable — so a regulator receiving Art. 55(1)(a) or EO 14110 §4.2(a) submissions cannot place them on a common scale, and the Frontier Foundation Model Eval Consortium's convergence remains only partial. Empirical consensus here is best described as emerging rather than settled. Two structural critiques sharpen the stakes. First, capability ceilings can be masked: enhancement techniques that cut training compute while preserving capability ⁹ let providers under-report against thresholds, and verification of training claims is itself unresolved ¹⁰. Second, the open-problems agenda ¹¹ lists technical-governance gaps that bear directly on what a capability evaluation must demonstrate.

Use in governance

How instruments operationalise this concept

Instrument	Jurisdiction	Status
EU AI Act	EU	in force
Executive Order 14110 on Safe, Secure, Trustworthy AI	US	partial
G7 Hiroshima AI Process Code of Conduct	G7	in force
Anthropic Responsible Scaling Policy (RSP) v2	US	in force
OpenAI Preparedness Framework	US	in force
Google DeepMind Frontier Safety Framework	US	in force
Meta Frontier AI Framework	US	in force
UK-US AI Safety Institute Memorandum of Understanding	global	in force
EU General-Purpose AI Code of Practice	EU	in force

Appears in topic articles

Social-science evidence — the “so-what”

What the peer-reviewed social science shows: whether the harm this concept addresses is empirically real, and whether governance of it works. The badge is the epistemic status of the evidence(not the policy debate) — “thin” or “absent” efficacy evidence is itself a finding (the “second silence”). Each epistemic-status label is Policy Window's editorial assessment of the cited evidence base (a structured classification), not a verdict any single source issues.

Is the harm real?evidence: contested
The under-elicitation problem the concept names is empirically demonstrated, but predominantly in deliberately constructed settings, so its magnitude at frontier scale is not well bounded. Dangerous-capability evaluation programs explicitly depend on elicitation because default prompting underestimates capability (Phuong et al. 2024). Stress tests using deliberately hidden-capability 'model organisms' show the gap is real and technique-dependent: a few high-quality demonstrations or fine-tuning recover password-locked capabilities, while prompting or activation steering are less reliable, and reinforcement learning often elicits capabilities when only evaluations are available (Greenblatt, Roger, Krasheninnikov & Krueger 2024; Hofstätter et al. 2025). Models can also be trained or prompted to strategically underperform on evaluations (van der Weij et al. 2024). The key caveat that prevents an 'established' rating: the large, clearly measured gaps come from intentionally constructed password-locked / circuit-broken organisms; how large the SPONTANEOUS under-elicitation gap is for an un-tampered frontier model remains poorly characterized.
Sources: Phuong et al. 2024 (Evaluating Frontier Models for Dangerous Capabilities, arXiv:2403.13793); Greenblatt, Roger, Krasheninnikov & Krueger 2024 (Stress-Testing Capability Elicitation With Password-Locked Models, arXiv:2405.19550, NeurIPS 2024); van der Weij et al. 2024 (AI Sandbagging: Language Models can Strategically Underperform on Evaluations, arXiv:2406.07358, ICLR 2025); Hofstätter et al. 2025 (The Elicitation Game: Evaluating Capability Elicitation Techniques, arXiv:2502.02180)
Does governance work?evidence: thin
There is no validated, standardised elicitation methodology shown to reliably establish a model's true capability ceiling for regulator-mandated evaluation. Frontier safety frameworks pledge to elicit the model's 'full capabilities' but the published summaries describe no benchmarked elicitation standard or validation procedure (METR 2025). Empirically, no single elicitation technique generalises across tasks (Hofstätter et al. 2025), and evaluation-awareness research finds models internally represent the test-vs-deployment distinction — with current safety evaluations already classified as artificial — which undermines the reliability of any fixed protocol (Nguyen, Hoang, Attubato & Hofstätter 2025). No study demonstrates that a mandated elicitation regime measurably closes the under-elicitation gap, so the evidence that governance works here is thin and partly rests on interpretation of policy documents rather than measured outcomes.
Sources: METR 2025 (Common Elements of Frontier AI Safety Policies, December 2025 update, metr.org/common-elements); Hofstätter et al. 2025 (The Elicitation Game: Evaluating Capability Elicitation Techniques, arXiv:2502.02180); Nguyen, Hoang, Attubato & Hofstätter 2025 (Probing and Steering Evaluation Awareness of Language Models, arXiv:2507.01786)

Editorial note

Distinguish from 'benchmarking' (average-case measurement) and 'red-teaming' (specific adversarial procedure). Capability elicitation is the umbrella; red-teaming is one technique under it.

References

Sources cited inline in the analysis (linked from the superscript markers), then the primary instrument sources behind the classifications.

Eloundou, Manning, Mishkin, Rock (2024) GPTs are GPTs: Labor market impact potential of LLMs, Science. 10.1126/science.adj0998 — Finds around 80% of the U.S. workforce "could have at least 10% of their work tasks affected" by LLMs, which exhibit "traits of general-purpose technologies". ↩
arXiv:2310.03693 ↩
Anderljung, Barnhart, Korinek, et al. (2023) Frontier AI Regulation: Managing Emerging Risks to Public Safety, arXiv. arXiv:2307.03718 — Argues "industry self-regulation is an important first step" but "government intervention will be needed", proposing safety standards, registration and reporting, and compliance mechanisms. ↩
Mary Phuong, Matthew Aitchison, Elliot Catt, et al. (Google DeepMind) (2024) Evaluating Frontier Models for Dangerous Capabilities, arXiv (cs.LG). arXiv:2403.13793 — Pilots dangerous-capability evaluations (persuasion, cyber, self-proliferation) on frontier models, finding 'early warning signs' but no strong present danger — grounding evaluation-based gating. ↩
Sastry, Heim, Belfield, Anderljung, Brundage, et al. (2024) Computing Power and the Governance of Artificial Intelligence, arXiv. arXiv:2402.08797 — Argues compute is a uniquely governable lever because it is "detectable, excludable, and quantifiable, and is produced via an extremely concentrated supply chain". ↩
Heim & Koessler (2024) Training Compute Thresholds: Features and Functions in AI Regulation, arXiv. arXiv:2405.10799 — Finds "training compute currently is the most suitable metric to identify GPAI models", but thresholds should only trigger further scrutiny, not determine risk measures alone. ↩
David Fernández-Llorca, Emilia Gómez, Ignacio Sánchez, Gabriele Mazzini (2025) An interdisciplinary account of the terminological choices by EU policymakers ahead of the final agreement on the AI Act: AI system, general purpose AI system, foundation model, and generative AI, Artificial Intelligence and Law. 10.1007/s10506-024-09412-y — Traces how the AI Act's legal text shifted across versions among the terms 'AI system, general purpose AI system, foundation model, and generative AI', exposing definitional instability in the regime. ↩
Martina Hulok (2025) The EU model of AI governance: regulating artificial intelligence through law and policy, ERA Forum. 10.1007/s12027-025-00869-1 — Analyses how the AI Act's risk-based model handles general-purpose and foundation models whose 'autonomous content generation challenges legal categories of authorship, accountability, and control'. ↩
Matteo Pistillo, Pablo Villalobos (2025) Defending Compute Thresholds Against Legal Loopholes, arXiv (cs.CY). arXiv:2502.00003 — Identifies 'enhancement techniques that are capable of decreasing training compute usage while preserving... model capabilities', exposing loopholes in compute-reporting thresholds. ↩
Akash R. Wasil, Tom Reed, Jack William Miller, Peter Barnett (2024) Verification methods for international AI agreements, arXiv (cs.CY). arXiv:2408.16074 — Surveys '10 verification methods that could detect... unauthorized AI training... and unauthorized data centers', mapping the technical basis for compute-disclosure regimes. ↩
Anka Reuel, Ben Bucknall, Stephen Casper, Tim Fist, Lennart Heim, et al. (34 authors) (2024) Open Problems in Technical AI Governance, arXiv (cs.CY). arXiv:2407.14981 — Catalogs open problems in 'technical analysis and tools for supporting the effective governance of AI', including compute measurement, verification and reporting gaps. ↩
Qi, X., Zeng, Y., Xie, T., Chen, P.-Y., Jia, R., Mittal, P., Henderson, P. (2023), 'Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!'

Cite this article 8 formats · BibTeX, RIS, APA, Chicago, … · 1-click copy

@misc{policywindow-capability-elicitation,
  title  = {Capability Elicitation},
  author = {Policy Window},
  year   = {n.d.},
  howpublished = {capability-elicitation — safety},
  url    = {https://policywindow.org/wiki/capability-elicitation},
  note   = {Primary source: https://arxiv.org/abs/2310.06987}
}

Verify the year + paste-and-refine. Primary source linked in BibTeX/RIS note.

Permalink downloads.bib .ris .csl.json

Persistent identifier: https://policywindow.org/wiki/capability-elicitation — committed-stable URL with content-versioning via ?asOf= (rollout pending per methodology §7). DOIs via Zenodo are on the roadmap.

Article tools — track changes, suggest an edit

View history — every captured revision of this article · What links here

Source: Edit on GitHub (search for `capability-elicitation`)

[ref-1] Eloundou, Manning, Mishkin, Rock (2024) GPTs are GPTs: Labor market impact potential of LLMs, Science. 10.1126/science.adj0998 — Finds around 80% of the U.S. workforce "could have at least 10% of their work tasks affected" by LLMs, which exhibit "traits of general-purpose technologies". ↩

[ref-2] arXiv:2310.03693 ↩

[ref-3] Anderljung, Barnhart, Korinek, et al. (2023) Frontier AI Regulation: Managing Emerging Risks to Public Safety, arXiv. arXiv:2307.03718 — Argues "industry self-regulation is an important first step" but "government intervention will be needed", proposing safety standards, registration and reporting, and compliance mechanisms. ↩

[ref-4] Mary Phuong, Matthew Aitchison, Elliot Catt, et al. (Google DeepMind) (2024) Evaluating Frontier Models for Dangerous Capabilities, arXiv (cs.LG). arXiv:2403.13793 — Pilots dangerous-capability evaluations (persuasion, cyber, self-proliferation) on frontier models, finding 'early warning signs' but no strong present danger — grounding evaluation-based gating. ↩

[ref-5] Sastry, Heim, Belfield, Anderljung, Brundage, et al. (2024) Computing Power and the Governance of Artificial Intelligence, arXiv. arXiv:2402.08797 — Argues compute is a uniquely governable lever because it is "detectable, excludable, and quantifiable, and is produced via an extremely concentrated supply chain". ↩

[ref-6] Heim & Koessler (2024) Training Compute Thresholds: Features and Functions in AI Regulation, arXiv. arXiv:2405.10799 — Finds "training compute currently is the most suitable metric to identify GPAI models", but thresholds should only trigger further scrutiny, not determine risk measures alone. ↩

[ref-7] David Fernández-Llorca, Emilia Gómez, Ignacio Sánchez, Gabriele Mazzini (2025) An interdisciplinary account of the terminological choices by EU policymakers ahead of the final agreement on the AI Act: AI system, general purpose AI system, foundation model, and generative AI, Artificial Intelligence and Law. 10.1007/s10506-024-09412-y — Traces how the AI Act's legal text shifted across versions among the terms 'AI system, general purpose AI system, foundation model, and generative AI', exposing definitional instability in the regime. ↩

[ref-8] Martina Hulok (2025) The EU model of AI governance: regulating artificial intelligence through law and policy, ERA Forum. 10.1007/s12027-025-00869-1 — Analyses how the AI Act's risk-based model handles general-purpose and foundation models whose 'autonomous content generation challenges legal categories of authorship, accountability, and control'. ↩

[ref-9] Matteo Pistillo, Pablo Villalobos (2025) Defending Compute Thresholds Against Legal Loopholes, arXiv (cs.CY). arXiv:2502.00003 — Identifies 'enhancement techniques that are capable of decreasing training compute usage while preserving... model capabilities', exposing loopholes in compute-reporting thresholds. ↩

[ref-10] Akash R. Wasil, Tom Reed, Jack William Miller, Peter Barnett (2024) Verification methods for international AI agreements, arXiv (cs.CY). arXiv:2408.16074 — Surveys '10 verification methods that could detect... unauthorized AI training... and unauthorized data centers', mapping the technical basis for compute-disclosure regimes. ↩

[ref-11] Anka Reuel, Ben Bucknall, Stephen Casper, Tim Fist, Lennart Heim, et al. (34 authors) (2024) Open Problems in Technical AI Governance, arXiv (cs.CY). arXiv:2407.14981 — Catalogs open problems in 'technical analysis and tools for supporting the effective governance of AI', including compute measurement, verification and reporting gaps. ↩

[ref-12] Qi, X., Zeng, Y., Xie, T., Chen, P.-Y., Jia, R., Mittal, P., Henderson, P. (2023), 'Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!'

Capability Elicitation

Definition & scope

Definition and Distinctions from Adjacent Methods

Mechanisms: How Capabilities Are Surfaced

Governance Relevance and Instrument Engagement

Debates and Open Questions

Use in governance

How instruments operationalise this concept

Appears in topic articles

Editorial note

See also

Further reading

References

Capability Elicitation

Definition & scope

Definition and Distinctions from Adjacent Methods

Mechanisms: How Capabilities Are Surfaced

Governance Relevance and Instrument Engagement

Debates and Open Questions

Use in governance

How instruments operationalise this concept

Appears in topic articles

Editorial note

See also

Further reading

References