Multi-Turn Evaluation

Policy Window Editorial Board

Multi-Turn Evaluation

multi-turn-evaluation · Frontier safety

Concept

Tools

Last verified 2026-06-21

Cite Share PDF

An evaluation methodology that probes AI models across multi-step conversations rather than single prompts — designed to surface deception, sycophancy, context-accumulation jailbreaks, and capability degradation that single-prompt benchmarks miss.

Definition & scope

Field consensus on this concept:emerging

Single-turn benchmarks (MMLU, HumanEval, GPQA) measure performance on independent prompts. Multi-turn evaluation extends the protocol to dialogues, with each model response feeding into the next prompt. This methodology surfaces failure modes that single-turn evaluation misses: (a) sycophancy drift — the model progressively conforms to user beliefs across turns (Sharma et al. 2023, 'Towards Understanding Sycophancy in Language Models'); (b) jailbreak via context accumulation — many-shot jailbreaking (Anil et al. 2024, Anthropic, 'Many-shot Jailbreaking') exploits the long context window; (c) deceptive alignment indicators — multi-turn probes can elicit inconsistencies between model self-reports across turns (Pacchiardi et al. 2023, 'How to Catch an AI Liar'); (d) capability elicitation — chain-of-thought + decomposition prompting often outperforms single-shot prompting (Wei et al. 2022, Andersson 2024). Benchmarks such as MT-Bench (Zheng et al. 2023), AgentBench (Liu et al. 2024), and HarmBench (Mazeika et al. 2024) operationalise the multi-turn protocol. Governance relevance: EU AI Act Art. 55(1)(a) adversarial-testing requirement presupposes that the testing methodology can detect deployment-realistic failure modes — many of which are multi-turn-only. UK AISI's pre-deployment evaluation suite includes multi-turn jailbreak + agentic-trajectory probes. NIST AI RMF GenAI Profile Manage 2.3 calls for evaluation 'across the lifecycle' which implicitly covers multi-turn. Standardisation across providers remains partial — each frontier lab uses a different multi-turn methodology, making cross-vendor comparison fraught (Frontier Foundation Model Eval Consortium converging slowly).

Definition and Distinctions from Single-Turn Benchmarks

Multi-turn evaluation is defined against its foil. Single-turn benchmarks (MMLU, HumanEval, GPQA) score independent prompts and so measure a static capability snapshot ¹. Multi-turn evaluation instead chains responses: each model output is fed into the next prompt, making the conversational state itself the object of measurement. This reframes evaluation from 'can the model answer?' to 'how does it behave as context accumulates?'. Failure modes such as sycophancy drift, where the model progressively conforms to a user's stated beliefs (Sharma et al. 2023), and context-accumulation jailbreaks (Anil et al. 2024) are invisible to single-shot protocols because they require a trajectory to manifest. Dialogue-shaped risks of this kind motivate the broader argument that frontier models need bespoke regulatory treatment rather than reuse of conventional-AI tooling ²; the point that existing regulation 'has primarily focused on conventional AI models, not LGAIMs' and should target concrete high-risk applications rather than the pre-trained model is made directly in ³. The notes field flags a parsing hazard: 'multi-turn evaluation' is an umbrella, and naming the specific sub-protocol (many-shot probing, agentic trajectory, conversational red-teaming) separates substantive disclosure from methodology-laundering.

Mechanisms: What Trajectories Surface That Snapshots Hide

Four mechanisms give multi-turn evaluation its diagnostic leverage. First, sycophancy drift accumulates over a dialogue as the model updates toward the user's expressed view (Sharma et al. 2023). Second, context-accumulation jailbreaking exploits the long context window: many-shot jailbreaking (Anil et al. 2024) packs the prompt with prior pseudo-exchanges that erode refusal behaviour, an attack surface that scales with context length and is unreachable single-shot. Third, deception probing compares a model's self-reports across turns, surfacing inconsistencies that flag possible lying (Pacchiardi et al. 2023). Fourth, capability elicitation: chain-of-thought and decomposition prompting frequently outperform single-shot prompting (Wei et al. 2022), so a one-shot score can understate true capability and produce false reassurance. Dangerous-capability suites operationalise this elicitation logic on frontier models, finding 'early warning signs' rather than present danger ⁴, and the gap between measured and elicitable capability is itself catalogued as an open technical-governance measurement problem ⁵. MT-Bench (Zheng et al. 2023), AgentBench (Liu et al. 2024) and HarmBench (Mazeika et al. 2024) instantiate these mechanisms as repeatable protocols.

Governance Relevance: Instruments and Provisions Engaged

Multi-turn evaluation is load-bearing for several regimes. EU AI Act Art. 55(1)(a) imposes a duty to conduct and document adversarial testing on general-purpose models with systemic risk; that duty is only meaningful if the testing detects deployment-realistic failures, many of which (context-accumulation jailbreaks, agentic trajectories) are multi-turn-only. The NIST AI RMF GenAI Profile's Manage function calls for evaluation that follows deployed systems beyond a static snapshot, which implicitly reaches conversational use, and UK AISI's pre-deployment suite explicitly includes multi-turn jailbreak and agentic-trajectory probes. This sits inside a wider regulatory turn toward evaluation-based gating, where dangerous-capability assessments inform deployment decisions ⁴ and where self-regulation is treated as a first step that government standards, registration and reporting must eventually backstop ². Yet definitional instability in the surrounding categories complicates uptake: scholarship tracing how the AI Act shifted across versions among 'AI system, general purpose AI system, foundation model, and generative AI' ⁶ shows the regulated objects themselves remain unsettled, and that autonomous content generation strains legal categories of accountability and control ⁷.

Debates and Open Questions

The empirical consensus is emerging, not settled, and the central dispute is standardisation. Because each frontier lab uses a different multi-turn methodology, cross-vendor comparison is fraught and the Frontier Foundation Model Eval Consortium is converging only slowly. This feeds the methodology-laundering risk the notes field names: a bare claim that 'we did multi-turn evaluation' can substitute for substantive protocol disclosure, defeating the comparability that Art. 55(1)(a) reporting presupposes. The literature on technical AI governance catalogs exactly this class of measurement and verification gap as an open problem ⁵, and transparency scholarship notes there is still 'no mature standard for documenting AI models' ⁸, leaving regulators reliant on provider self-report. That reliance is itself contested: work on the political economy of algorithmic audits warns that audit and evaluation markets can entrench rather than constrain the power they claim to check ⁹, and oversight scholarship finds legally mandated human review is often a 'rubber-stamp' unless effectiveness conditions are explicitly engineered ¹⁰. A further open question is elicitation sufficiency: since multi-turn prompting can raise measured capability (Wei et al. 2022), no protocol can prove it has elicited a model's ceiling, so negative results remain provisional rather than dispositive.

Use in governance

How instruments operationalise this concept

Instrument	Jurisdiction	Status
EU AI Act	EU	in force
NIST AI RMF Generative AI Profile	US	in force

Appears in topic articles

Social-science evidence — the “so-what”

What the peer-reviewed social science shows: whether the harm this concept addresses is empirically real, and whether governance of it works. The badge is the epistemic status of the evidence(not the policy debate) — “thin” or “absent” efficacy evidence is itself a finding (the “second silence”). Each epistemic-status label is Policy Window's editorial assessment of the cited evidence base (a structured classification), not a verdict any single source issues.

Is the harm real?evidence: established
The phenomenon multi-turn evaluation targets is empirically real and well-measured: failures that single-turn probes miss appear reliably across conversation. Sycophancy is a general behavior of RLHF-trained assistants, observed across five state-of-the-art models and traced to human-preference data where humans and preference models prefer convincingly-written sycophantic responses over correct ones (Sharma et al. 2023). A substantial multi-turn performance/reliability collapse is documented: Laban et al. 2025 report an average ~39% drop across six generation tasks, driven mainly by a large increase in unreliability (with only a minor aptitude loss); MT-Bench-101 (Bai et al. 2024) finds LLM performance often declines across dialogue turns in tasks requiring sustained context memory and resistance to interference. And multi-turn jailbreaks succeed where single-turn attacks fail: Russinovich et al. 2024's Crescendo escalates from benign to harmful over a dialogue and achieves high per-task attack success rates against GPT-4 and Gemini-Pro that single-turn baselines do not (their paper reports per-task ASR, not a single aggregate figure). Caveat: this establishes that multi-turn interaction surfaces distinct failures, not that any single multi-turn protocol is canonical.
Sources: Sharma et al. 2023 (Towards Understanding Sycophancy in Language Models, arXiv:2310.13548); Laban, Hayashi, Zhou & Neville 2025 (LLMs Get Lost in Multi-Turn Conversation, arXiv:2505.06120); Bai et al. 2024 (MT-Bench-101, ACL 2024, arXiv:2402.14762); Russinovich, Salem & Eldan 2024 (Crescendo Multi-Turn Jailbreak, arXiv:2404.01833)
Does governance work?evidence: thin
Multi-turn evaluation demonstrably reveals more harms than single-turn testing as a detection instrument (Perez et al. 2022 used LM-generated red-teaming, including conversational/dialogue test cases, to auto-surface tens of thousands of harmful replies; Russinovich et al. 2024 show single-turn testing understates real risk because benign-then-escalating dialogue jailbreaks models that resist single-turn attacks), but there is no rigorous evidence that adopting multi-turn evaluation as a governance regime measurably reduces downstream harm, no validated/standardized multi-turn safety protocol relied on in regulation, and no agreed coverage guarantee for how many turns or which dynamics suffice. The available evidence shows only added detection signal, not demonstrated mitigation efficacy.
Sources: Perez et al. 2022 (Red Teaming Language Models with Language Models, EMNLP 2022, arXiv:2202.03286); Russinovich, Salem & Eldan 2024 (Crescendo Multi-Turn Jailbreak, arXiv:2404.01833)

Editorial note

Multi-turn evaluation is the umbrella; specific protocols (many-shot probing, agentic trajectories, conversational red-teaming) are sub-cases. When citing in policy text, name the specific protocol to avoid the methodology-laundering risk where 'we did multi-turn evaluation' substitutes for substantive methodology disclosure.

References

Sources cited inline in the analysis, numbered in order of appearance.

Zheng, L., et al. (2023), 'Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena' — operationalises the multi-turn evaluation protocol for foundation models. Multi-Turn Evaluation. arXiv:2306.05685 — Zheng, L., et al. (2023), 'Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena' — operationalises the multi-turn evaluation protocol for foundation models. ↩
Anderljung, Barnhart, Korinek, et al. (2023) Frontier AI Regulation: Managing Emerging Risks to Public Safety, arXiv. arXiv:2307.03718 — Argues "industry self-regulation is an important first step" but "government intervention will be needed", proposing safety standards, registration and reporting, and compliance mechanisms. ↩
Hacker, Engel & Mauer (2023) Regulating ChatGPT and other Large Generative AI Models, ACM FAccT '23. 10.1145/3593013.3594067 — Argues AI regulation "has primarily focused on conventional AI models, not LGAIMs" and should target "concrete high-risk applications, and not the pre-trained model itself". ↩
Mary Phuong, Matthew Aitchison, Elliot Catt, et al. (Google DeepMind) (2024) Evaluating Frontier Models for Dangerous Capabilities, arXiv (cs.LG). arXiv:2403.13793 — Pilots dangerous-capability evaluations (persuasion, cyber, self-proliferation) on frontier models, finding 'early warning signs' but no strong present danger — grounding evaluation-based gating. ↩
Anka Reuel, Ben Bucknall, Stephen Casper, Tim Fist, Lennart Heim, et al. (34 authors) (2024) Open Problems in Technical AI Governance, arXiv (cs.CY). arXiv:2407.14981 — Catalogs open problems in 'technical analysis and tools for supporting the effective governance of AI', including compute measurement, verification and reporting gaps. ↩
David Fernández-Llorca, Emilia Gómez, Ignacio Sánchez, Gabriele Mazzini (2025) An interdisciplinary account of the terminological choices by EU policymakers ahead of the final agreement on the AI Act: AI system, general purpose AI system, foundation model, and generative AI, Artificial Intelligence and Law. 10.1007/s10506-024-09412-y — Traces how the AI Act's legal text shifted across versions among the terms 'AI system, general purpose AI system, foundation model, and generative AI', exposing definitional instability in the regime. ↩
Martina Hulok (2025) The EU model of AI governance: regulating artificial intelligence through law and policy, ERA Forum. 10.1007/s12027-025-00869-1 — Analyses how the AI Act's risk-based model handles general-purpose and foundation models whose 'autonomous content generation challenges legal categories of authorship, accountability, and control'. ↩
Henrik Palmer Olsen, Thomas Troels Hildebrandt, Cornelius Wiesener, Matthias Smed Larsen, Asbjørn William Ammitzbøll Flügge (2024) The Right to Transparency in Public Governance: Freedom of Information and the Use of Artificial Intelligence by Public Agencies, Digital Government: Research and Practice. 10.1145/3632753 — Finds freedom-of-information regimes "generally only grant access to existing documents" and that with "no mature standard for documenting AI models," public-sector AI transparency is limited. ↩
Petros Terzis, Michael Veale, Noëlle Gaumann (2024) Law and the Emerging Political Economy of Algorithmic Audits, Proceedings of the 2024 ACM Conference on Fairness, Accounta. 10.1145/3630106.3658970 — Analyses how AI-audit mandates create a new political economy of auditing, warning that audit markets can entrench rather than constrain power without underlying governance. ↩
Sarah Sterz, Kevin Baum, Sebastian Biewer, Holger Hermanns, Anne Lauber-Rönsberg, Philip Meinel, Markus Langer (2024) On the Quest for Effectiveness in Human Oversight: Interdisciplinary Perspectives, Proceedings of the 2024 ACM Conference on Fairness, Accounta. 10.1145/3630106.3659051 — Synthesises interdisciplinary evidence to argue that legally mandated human oversight of AI is often ineffective ('rubber-stamp') unless effectiveness conditions are explicitly designed for. ↩

Cite this article 8 formats · BibTeX, RIS, APA, Chicago, … · 1-click copy

@misc{policywindow-multi-turn-evaluation,
  title  = {Multi-Turn Evaluation},
  author = {Policy Window},
  year   = {n.d.},
  howpublished = {multi-turn-evaluation — safety},
  url    = {https://policywindow.org/wiki/multi-turn-evaluation},
  note   = {Primary source: https://arxiv.org/abs/2306.05685}
}

Verify the year + paste-and-refine. Primary source linked in BibTeX/RIS note.

Permalink downloads.bib .ris .csl.json

Persistent identifier: https://policywindow.org/wiki/multi-turn-evaluation — committed-stable URL with content-versioning via ?asOf= (rollout pending per methodology §7). DOIs via Zenodo are on the roadmap.

Article tools — track changes, suggest an edit

View history — every captured revision of this article · What links here

Source: Edit on GitHub (search for `multi-turn-evaluation`)

[ref-1] Zheng, L., et al. (2023), 'Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena' — operationalises the multi-turn evaluation protocol for foundation models. Multi-Turn Evaluation. arXiv:2306.05685 — Zheng, L., et al. (2023), 'Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena' — operationalises the multi-turn evaluation protocol for foundation models. ↩

[ref-2] Anderljung, Barnhart, Korinek, et al. (2023) Frontier AI Regulation: Managing Emerging Risks to Public Safety, arXiv. arXiv:2307.03718 — Argues "industry self-regulation is an important first step" but "government intervention will be needed", proposing safety standards, registration and reporting, and compliance mechanisms. ↩

[ref-3] Hacker, Engel & Mauer (2023) Regulating ChatGPT and other Large Generative AI Models, ACM FAccT '23. 10.1145/3593013.3594067 — Argues AI regulation "has primarily focused on conventional AI models, not LGAIMs" and should target "concrete high-risk applications, and not the pre-trained model itself". ↩

[ref-4] Mary Phuong, Matthew Aitchison, Elliot Catt, et al. (Google DeepMind) (2024) Evaluating Frontier Models for Dangerous Capabilities, arXiv (cs.LG). arXiv:2403.13793 — Pilots dangerous-capability evaluations (persuasion, cyber, self-proliferation) on frontier models, finding 'early warning signs' but no strong present danger — grounding evaluation-based gating. ↩

[ref-5] Anka Reuel, Ben Bucknall, Stephen Casper, Tim Fist, Lennart Heim, et al. (34 authors) (2024) Open Problems in Technical AI Governance, arXiv (cs.CY). arXiv:2407.14981 — Catalogs open problems in 'technical analysis and tools for supporting the effective governance of AI', including compute measurement, verification and reporting gaps. ↩

[ref-6] David Fernández-Llorca, Emilia Gómez, Ignacio Sánchez, Gabriele Mazzini (2025) An interdisciplinary account of the terminological choices by EU policymakers ahead of the final agreement on the AI Act: AI system, general purpose AI system, foundation model, and generative AI, Artificial Intelligence and Law. 10.1007/s10506-024-09412-y — Traces how the AI Act's legal text shifted across versions among the terms 'AI system, general purpose AI system, foundation model, and generative AI', exposing definitional instability in the regime. ↩

[ref-7] Martina Hulok (2025) The EU model of AI governance: regulating artificial intelligence through law and policy, ERA Forum. 10.1007/s12027-025-00869-1 — Analyses how the AI Act's risk-based model handles general-purpose and foundation models whose 'autonomous content generation challenges legal categories of authorship, accountability, and control'. ↩

[ref-8] Henrik Palmer Olsen, Thomas Troels Hildebrandt, Cornelius Wiesener, Matthias Smed Larsen, Asbjørn William Ammitzbøll Flügge (2024) The Right to Transparency in Public Governance: Freedom of Information and the Use of Artificial Intelligence by Public Agencies, Digital Government: Research and Practice. 10.1145/3632753 — Finds freedom-of-information regimes "generally only grant access to existing documents" and that with "no mature standard for documenting AI models," public-sector AI transparency is limited. ↩

[ref-9] Petros Terzis, Michael Veale, Noëlle Gaumann (2024) Law and the Emerging Political Economy of Algorithmic Audits, Proceedings of the 2024 ACM Conference on Fairness, Accounta. 10.1145/3630106.3658970 — Analyses how AI-audit mandates create a new political economy of auditing, warning that audit markets can entrench rather than constrain power without underlying governance. ↩

[ref-10] Sarah Sterz, Kevin Baum, Sebastian Biewer, Holger Hermanns, Anne Lauber-Rönsberg, Philip Meinel, Markus Langer (2024) On the Quest for Effectiveness in Human Oversight: Interdisciplinary Perspectives, Proceedings of the 2024 ACM Conference on Fairness, Accounta. 10.1145/3630106.3659051 — Synthesises interdisciplinary evidence to argue that legally mandated human oversight of AI is often ineffective ('rubber-stamp') unless effectiveness conditions are explicitly designed for. ↩

Multi-Turn Evaluation

Definition & scope

Definition and Distinctions from Single-Turn Benchmarks

Mechanisms: What Trajectories Surface That Snapshots Hide

Governance Relevance: Instruments and Provisions Engaged

Debates and Open Questions

Use in governance

How instruments operationalise this concept

Appears in topic articles

Editorial note

See also

Further reading

References

Multi-Turn Evaluation

Definition & scope

Definition and Distinctions from Single-Turn Benchmarks

Mechanisms: What Trajectories Surface That Snapshots Hide

Governance Relevance: Instruments and Provisions Engaged

Debates and Open Questions

Use in governance

How instruments operationalise this concept

Appears in topic articles

Editorial note

See also

Further reading

References