An evaluation methodology that probes AI models across multi-step conversations rather than single prompts — designed to surface deception, sycophancy, context-accumulation jailbreaks, and capability degradation that single-prompt benchmarks miss.
Definition and scope
Single-turn benchmarks (MMLU, HumanEval, GPQA) measure performance on independent prompts. Multi-turn evaluation extends the protocol to dialogues, with each model response feeding into the next prompt. This methodology surfaces failure modes that single-turn evaluation misses: (a) sycophancy drift — the model progressively conforms to user beliefs across turns (Sharma et al. 2023, 'Towards Understanding Sycophancy in Language Models'); (b) jailbreak via context accumulation — many-shot jailbreaking (Anil et al. 2024, Anthropic, 'Many-shot Jailbreaking') exploits the long context window; (c) deceptive alignment indicators — multi-turn probes can elicit inconsistencies between model self-reports across turns (Pacchiardi et al. 2023, 'How to Catch an AI Liar'); (d) capability elicitation — chain-of-thought + decomposition prompting often outperforms single-shot prompting (Wei et al. 2022, Andersson 2024). Benchmarks such as MT-Bench (Zheng et al. 2023), AgentBench (Liu et al. 2024), and HarmBench (Mazeika et al. 2024) operationalise the multi-turn protocol. Governance relevance: EU AI Act Art. 55(1)(a) adversarial-testing requirement presupposes that the testing methodology can detect deployment-realistic failure modes — many of which are multi-turn-only. UK AISI's pre-deployment evaluation suite includes multi-turn jailbreak + agentic-trajectory probes. NIST AI RMF GenAI Profile Manage 2.3 calls for evaluation 'across the lifecycle' which implicitly covers multi-turn. Standardisation across providers remains partial — each frontier lab uses a different multi-turn methodology, making cross-vendor comparison fraught (Frontier Foundation Model Eval Consortium converging slowly).
Used by these instruments
Related concepts
- Capability Elicitation— Techniques designed to reveal the upper bounds of an AI model's capabilities, rather than measuring
- Red-Team Evaluation— Structured adversarial probing of an AI model's capabilities and behaviour before deployment, design
- Jailbreak Resistance— The robustness of an AI model's safety training against adversarial prompts crafted to elicit policy
- Deceptive Alignment— A failure mode in which a model appears aligned during training and evaluation because doing so serv
- Sandbagging— A theoretical failure mode in which a model deliberately underperforms on capability evaluations — e
- Agentic AI System— An AI system that takes actions in the world — calling tools, executing code, browsing the web, send
Appears in topic articles
Editorial note
Multi-turn evaluation is the umbrella; specific protocols (many-shot probing, agentic trajectories, conversational red-teaming) are sub-cases. When citing in policy text, name the specific protocol to avoid the methodology-laundering risk where 'we did multi-turn evaluation' substitutes for substantive methodology disclosure.
References
Take this further — sign up free
Save, compare, or get alerts when Multi-Turn Evaluation changes. Policy Window is the analyst workbench layered on top of this wiki — free for researchers, civil society, and verified policymakers.