?asOf= parameter to see the current catalog state.Capability Elicitation
capability-elicitation · Frontier safety
Techniques designed to reveal the upper bounds of an AI model's capabilities, rather than measuring its default behaviour, so that downstream safety judgements can be calibrated to what the model *can* do under adversarial prompting or fine-tuning.
Definition and scope
Capability elicitation is methodologically distinct from benchmarking. A benchmark measures average performance under standard prompting; elicitation aims to surface the model's actual capability ceiling. Common methods: (a) adversarial prompting — red-team-style attempts to invoke a withheld behaviour (Branwen 2020, Weidinger et al. 2024); (b) chain-of-thought + structured prompting — forcing step-by-step reasoning, often revealing skills the model would otherwise hide or skip (Wei et al. 2022); (c) multi-stage / decomposition prompting — breaking tasks into sub-tasks that decompose deception incentives (Andersson 2024); (d) fine-tuning pressure — does the safety behaviour break under modest fine-tuning, indicating the underlying capability is preserved (Qi et al. 2023, 'Fine-tuning Aligned LLMs')? Governance relevance: EU AI Act Art. 55(1)(a) adversarial testing presupposes elicitation methods exist. US EO 14110 §4.2(a) reporting includes red-team results, which depend on elicitation methodology choices. The lack of standardisation across elicitation methods is one reason regulator-mandated evaluation results are not directly comparable across providers (Anthropic's elicitation suite ≠ OpenAI's ≠ DeepMind's). The Frontier Foundation Model Eval Consortium is attempting to converge methodology; consensus remains partial.
Used by these instruments
- EU AI Act· EU
- Executive Order 14110 on Safe, Secure, Trustworthy AI· US
- G7 Hiroshima AI Process Code of Conduct· G7
- Anthropic Responsible Scaling Policy (RSP) v2· US
- OpenAI Preparedness Framework· US
- Google DeepMind Frontier Safety Framework· US
- Meta Frontier AI Framework· US
- UK-US AI Safety Institute Memorandum of Understanding· global
Related concepts
- AI Alignment— The technical problem of designing AI systems whose objectives, behaviour, and emergent goals reliab
- Scalable Oversight— The set of techniques for supervising AI systems whose outputs are too complex, too numerous, or too
- Red-Team Evaluation— Structured adversarial probing of an AI model's capabilities and behaviour before deployment, design
- Deceptive Alignment— A failure mode in which a model appears aligned during training and evaluation because doing so serv
Appears in topic articles
Editorial note
Distinguish from 'benchmarking' (average-case measurement) and 'red-teaming' (specific adversarial procedure). Capability elicitation is the umbrella; red-teaming is one technique under it.
References
Cite this article
6 formats · 1-click copyPersistent identifier: https://policywindow.org/wiki/capability-elicitation — committed-stable URL with content-versioning via ?asOf= (rollout pending per methodology §7). DOIs via Zenodo are on the roadmap.
Take this further — sign up free
Save, compare, or get alerts when Capability Elicitation changes. Policy Window is the analyst workbench layered on top of this wiki — free for researchers, civil society, and verified policymakers.