Print-friendly view · use your browser's Save as PDF option (Cmd/Ctrl-P) to attach this article to a brief.

Multi-Turn Evaluation

multi-turn-evaluation · safety · concept

Source: https://policywindow.org/wiki/multi-turn-evaluation

Generated 2026-05-30T22:09:35 UTC

Summary

An evaluation methodology that probes AI models across multi-step conversations rather than single prompts — designed to surface deception, sycophancy, context-accumulation jailbreaks, and capability degradation that single-prompt benchmarks miss.

At a glance

Used by: 2 instrument(s)
Related concepts: capability-elicitation, red-team-evaluation, jailbreak-resistance, deceptive-alignment, sandbagging, agentic-system
Primary source: Zheng, L., et al. (2023), 'Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena' — operationalises the multi-turn evaluation protocol for foundation models.
Source URL: https://arxiv.org/abs/2306.05685

Details

Single-turn benchmarks (MMLU, HumanEval, GPQA) measure performance on independent prompts. Multi-turn evaluation extends the protocol to dialogues, with each model response feeding into the next prompt. This methodology surfaces failure modes that single-turn evaluation misses: (a) sycophancy drift — the model progressively conforms to user beliefs across turns (Sharma et al. 2023, 'Towards Understanding Sycophancy in Language Models'); (b) jailbreak via context accumulation — many-shot jailbreaking (Anil et al. 2024, Anthropic, 'Many-shot Jailbreaking') exploits the long context window; (c) deceptive alignment indicators — multi-turn probes can elicit inconsistencies between model self-reports across turns (Pacchiardi et al. 2023, 'How to Catch an AI Liar'); (d) capability elicitation — chain-of-thought + decomposition prompting often outperforms single-shot prompting (Wei et al. 2022, Andersson 2024). Benchmarks such as MT-Bench (Zheng et al. 2023), AgentBench (Liu et al. 2024), and HarmBench (Mazeika et al. 2024) operationalise the multi-turn protocol. Governance relevance: EU AI Act Art. 55(1)(a) adversarial-testing requirement presupposes that the testing methodology can detect deployment-realistic failure modes — many of which are multi-turn-only. UK AISI's pre-deployment evaluation suite includes multi-turn jailbreak + agentic-trajectory probes. NIST AI RMF GenAI Profile Manage 2.3 calls for evaluation 'across the lifecycle' which implicitly covers multi-turn. Standardisation across providers remains partial — each frontier lab uses a different multi-turn methodology, making cross-vendor comparison fraught (Frontier Foundation Model Eval Consortium converging slowly).

How to cite this article

APA

Policy Window. (n.d.). Multi-Turn Evaluation [Wiki article — Concept]. https://policywindow.org/wiki/multi-turn-evaluation

Chicago

Policy Window. n.d.. "Multi-Turn Evaluation." Wiki article (Concept). https://policywindow.org/wiki/multi-turn-evaluation.

Harvard

Policy Window (n.d.) 'Multi-Turn Evaluation', Wiki article — Concept, available at: https://policywindow.org/wiki/multi-turn-evaluation.

OSCOLA

Policy Window, 'Multi-Turn Evaluation' (Wiki article — Concept, n.d.) <https://policywindow.org/wiki/multi-turn-evaluation> accessed [date].

BibTeX

@misc{policywindow-multi-turn-evaluation,
  title  = {Multi-Turn Evaluation},
  author = {Policy Window},
  year   = {n.d.},
  howpublished = {multi-turn-evaluation — safety},
  url    = {https://policywindow.org/wiki/multi-turn-evaluation},
  note   = {Primary source: https://arxiv.org/abs/2306.05685}
}