Capability Elicitation

Policy Window Editorial Board

Pinned snapshot. This article is rendered from the catalog state captured at 2026-05-29 (closest snapshot at-or-before your requested date of 2026-05-29). The live version may differ — drop the ?asOf= parameter to see the current catalog state.

Capability Elicitation

capability-elicitation · Frontier safety

ConceptEditorial review pending

Cite Share PDF

Techniques designed to reveal the upper bounds of an AI model's capabilities, rather than measuring its default behaviour, so that downstream safety judgements can be calibrated to what the model *can* do under adversarial prompting or fine-tuning.

Definition and scope

Capability elicitation is methodologically distinct from benchmarking. A benchmark measures average performance under standard prompting; elicitation aims to surface the model's actual capability ceiling. Common methods: (a) adversarial prompting — red-team-style attempts to invoke a withheld behaviour (Branwen 2020, Weidinger et al. 2024); (b) chain-of-thought + structured prompting — forcing step-by-step reasoning, often revealing skills the model would otherwise hide or skip (Wei et al. 2022); (c) multi-stage / decomposition prompting — breaking tasks into sub-tasks that decompose deception incentives (Andersson 2024); (d) fine-tuning pressure — does the safety behaviour break under modest fine-tuning, indicating the underlying capability is preserved (Qi et al. 2023, 'Fine-tuning Aligned LLMs')? Governance relevance: EU AI Act Art. 55(1)(a) adversarial testing presupposes elicitation methods exist. US EO 14110 §4.2(a) reporting includes red-team results, which depend on elicitation methodology choices. The lack of standardisation across elicitation methods is one reason regulator-mandated evaluation results are not directly comparable across providers (Anthropic's elicitation suite ≠ OpenAI's ≠ DeepMind's). The Frontier Foundation Model Eval Consortium is attempting to converge methodology; consensus remains partial.

Used by these instruments

Related concepts

AI Alignment— The technical problem of designing AI systems whose objectives, behaviour, and emergent goals reliab
Scalable Oversight— The set of techniques for supervising AI systems whose outputs are too complex, too numerous, or too
Red-Team Evaluation— Structured adversarial probing of an AI model's capabilities and behaviour before deployment, design
Deceptive Alignment— A failure mode in which a model appears aligned during training and evaluation because doing so serv

Appears in topic articles

Editorial note

Distinguish from 'benchmarking' (average-case measurement) and 'red-teaming' (specific adversarial procedure). Capability elicitation is the umbrella; red-teaming is one technique under it.

References

Qi, X., Zeng, Y., Xie, T., Chen, P.-Y., Jia, R., Mittal, P., Henderson, P. (2023), 'Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!'

Cite this article

6 formats · 1-click copy

@misc{policywindow-capability-elicitation,
  title  = {Capability Elicitation},
  author = {Policy Window},
  year   = {n.d.},
  howpublished = {capability-elicitation — safety},
  url    = {https://policywindow.org/wiki/capability-elicitation},
  note   = {Primary source: https://arxiv.org/abs/2310.06987}
}

Verify the year + paste-and-refine. Primary source linked in BibTeX/RIS note.

Persistent identifier: https://policywindow.org/wiki/capability-elicitation — committed-stable URL with content-versioning via ?asOf= (rollout pending per methodology §7). DOIs via Zenodo are on the roadmap.

Take this further — sign up free

Save, compare, or get alerts when Capability Elicitation changes. Policy Window is the analyst workbench layered on top of this wiki — free for researchers, civil society, and verified policymakers.

Save this article Get alerts on changes Compare with another article

Source: Edit on GitHub (search for `capability-elicitation`)

Spotted a stale fact or missing source? Report a problem with this page →

[1] Qi, X., Zeng, Y., Xie, T., Chen, P.-Y., Jia, R., Mittal, P., Henderson, P. (2023), 'Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!'