Techniques designed to reveal the upper bounds of an AI model's capabilities, rather than measuring its default behaviour, so that downstream safety judgements can be calibrated to what the model *can* do under adversarial prompting or fine-tuning.
Definition and scope
Capability elicitation is methodologically distinct from benchmarking. A benchmark measures average performance under standard prompting; elicitation aims to surface the model's actual capability ceiling. Common methods: (a) adversarial prompting — red-team-style attempts to invoke a withheld behaviour (Branwen 2020, Weidinger et al. 2024); (b) chain-of-thought + structured prompting — forcing step-by-step reasoning, often revealing skills the model would otherwise hide or skip (Wei et al. 2022); (c) multi-stage / decomposition prompting — breaking tasks into sub-tasks that decompose deception incentives (Andersson 2024); (d) fine-tuning pressure — does the safety behaviour break under modest fine-tuning, indicating the underlying capability is preserved (Qi et al. 2023, 'Fine-tuning Aligned LLMs')? Governance relevance: EU AI Act Art. 55(1)(a) adversarial testing presupposes elicitation methods exist. US EO 14110 §4.2(a) reporting includes red-team results, which depend on elicitation methodology choices. The lack of standardisation across elicitation methods is one reason regulator-mandated evaluation results are not directly comparable across providers (Anthropic's elicitation suite ≠ OpenAI's ≠ DeepMind's). The Frontier Foundation Model Eval Consortium is attempting to converge methodology; consensus remains partial.
Used by these instruments
- EU AI Act· EU
- Executive Order 14110 on Safe, Secure, Trustworthy AI· US
- G7 Hiroshima AI Process Code of Conduct· G7
- Anthropic Responsible Scaling Policy (RSP) v2· US
- OpenAI Preparedness Framework· US
- Google DeepMind Frontier Safety Framework· US
- Meta Frontier AI Framework· US
- UK-US AI Safety Institute Memorandum of Understanding· global
Related concepts
- AI Alignment— The technical problem of designing AI systems whose objectives, behaviour, and emergent goals reliab
- Scalable Oversight— The set of techniques for supervising AI systems whose outputs are too complex, too numerous, or too
- Red-Team Evaluation— Structured adversarial probing of an AI model's capabilities and behaviour before deployment, design
- Deceptive Alignment— A failure mode in which a model appears aligned during training and evaluation because doing so serv
Appears in topic articles
Editorial note
Distinguish from 'benchmarking' (average-case measurement) and 'red-teaming' (specific adversarial procedure). Capability elicitation is the umbrella; red-teaming is one technique under it.
References
Take this further — sign up free
Save, compare, or get alerts when Capability Elicitation changes. Policy Window is the analyst workbench layered on top of this wiki — free for researchers, civil society, and verified policymakers.