Tool-Use Safety

Policy Window Editorial Board

Tool-Use Safety

tool-use-safety · Frontier safety

Concept

Tools

Last verified 2026-06-21

Cite Share PDF

The sub-domain of agentic-system safety concerned with the risks that arise when an AI model invokes external tools (search, code execution, APIs, financial transactions, system commands) — including risks of unintended action, instruction subversion, privilege escalation, and resource consumption.

Definition & scope

Field consensus on this concept:emerging

Tool-use safety treats the model + tool surface as the unit of analysis rather than the model in isolation. The risk surface expands along several axes: (a) capability composition — a chat-safe model may become capability-dangerous when given a code-execution tool plus internet access; (b) instruction-channel adversaries — tool outputs are an indirect-prompt-injection vector (a web search result containing adversarial instructions); (c) privilege escalation — tools that share authentication with the user may be invoked beyond user intent; (d) resource exhaustion — agents can spend money, compute, or API credits at machine speed; (e) confused-deputy attacks — the tool acts with the user's authority on instructions actually from a third party. Mitigation patterns include: capability allowlists (only specific tools, specific scopes), human-in-the-loop confirmation for high-impact actions (the OpenAI Operator + Anthropic Computer Use UX patterns), output-isolation tags (Anthropic's tool-result-tag scheme), and gateway-LLM patterns (Wallace et al. 2024 dual-LLM). NIST AI RMF GenAI Profile §2.7 'Value Chain and Component Integration' touches the tool-integration risk. EU AI Act Art. 14 'human oversight' is the closest binding obligation but presumes human-bandwidth-feasible review, which agentic systems break at scale. Industry-side frameworks (Anthropic RSP, OpenAI Preparedness) treat tool-use capability as a tier-relevant signal.

Unit of Analysis and Distinctions

Tool-use safety reframes the object of governance from the model in isolation to the model-plus-tool surface. A chat-only model that is benign in conversation can become capability-dangerous once granted code execution plus internet access, because capability composition produces an action space larger than any constituent part. This is the analytical wedge that separates tool-use safety from generic agentic-system safety: the action surface is mediated by discrete, enumerable tool calls rather than open-ended autonomy. The boundary blurs, however, when one of those tools is code execution — effectively a universal action that re-admits the open-endedness the discrete framing tried to bound. Dangerous-capability evaluations on frontier models ¹ treat self-proliferation and cyber-offence as composed capabilities, finding 'early warning signs' but no present strong danger — exactly the composition the tool surface unlocks. The evaluation paradigm itself was proposed precisely to probe such emergent dangers before deployment ², and red-team work shows the stakes concretely: tool-augmented LLMs 'may also confer easy access to dual-use technologies capable of inflicting great harm' ³.

Adversarial Mechanisms and Mitigations

Four distinct failure mechanisms structure the field. Instruction-channel subversion treats tool outputs as an indirect-prompt-injection vector: a web result carrying adversarial text can rewrite the agent's task. Privilege escalation arises where a tool shares authentication with the user, letting invocations exceed user intent; the confused-deputy variant has the tool act with the user's authority on a third party's instructions. Resource exhaustion lets agents spend money, compute, or API credits at machine speed. The canonical industry defence is an instruction hierarchy that trains models to prioritise privileged instructions over untrusted tool content ⁴. Complementary heuristics include capability allowlists scoping which tools and scopes are reachable, output-isolation tags around tool results, the dual-LLM gateway pattern, and human-in-the-loop confirmation as seen in the Operator and Computer Use UX. These remain largely self-imposed: governance scholars argue frontier-AI oversight should begin with high-level safety principles and migrate to detailed rules — including mandated dangerous-capability evaluations — as regulatory capacity matures ⁵, while broader critique holds that 'AI safety research is lagging' relative to the pace of deployment ⁶.

Governance Relevance

Binding law engages tool-use safety only obliquely. EU AI Act Art. 14 'human oversight' is the closest mandatory hook, but it presumes review at human bandwidth — an assumption agentic systems break when they fire tool calls at machine speed, leaving the per-action confirmation model economically and cognitively infeasible at scale. The NIST AI RMF GenAI Profile names the tool-integration risk descriptively under its 'Value Chain and Component Integration' category but is voluntary. Definitional instability compounds the gap: EU policymakers shifted among 'AI system, general purpose AI system, foundation model, and generative AI' ⁷, and the risk-based model strains where autonomous generation 'challenges legal categories of authorship, accountability, and control' ⁸. Industry frameworks — Anthropic RSP, OpenAI Preparedness — instead treat tool-use capability as a tier-relevant signal, gated by evaluations like ¹.

Debates and Open Questions

The empirical consensus is emerging rather than settled, and several questions remain open. First, the boundary problem: if code execution is a universal action, is tool-use safety a coherent sub-domain or merely agentic safety re-labelled? Second, the oversight-feasibility gap — whether Art. 14-style human review can scale, or whether automated gateways must replace it, trading one trust assumption for another. Third, the risk-temporality debate: Kasirzadeh ⁹ distinguishes 'decisive' sudden-takeover risk from 'accumulative' erosion, and tool-enabled agents plausibly drive the accumulative path through many small unauthorised actions. Bengio, Hinton and colleagues warn that 'AI safety research is lagging' and that present governance 'lacks the mechanisms and institutions to prevent misuse and recklessness' ⁶ — a critique that lands squarely on the voluntary, evaluation-gated status of current tool-use-safety practice.

Use in governance

How instruments operationalise this concept

Instrument	Jurisdiction	Status
NIST AI RMF Generative AI Profile	US	in force

Appears in topic articles

Social-science evidence — the “so-what”

What the peer-reviewed social science shows: whether the harm this concept addresses is empirically real, and whether governance of it works. The badge is the epistemic status of the evidence(not the policy debate) — “thin” or “absent” efficacy evidence is itself a finding (the “second silence”). Each epistemic-status label is Policy Window's editorial assessment of the cited evidence base (a structured classification), not a verdict any single source issues.

Is the harm real?evidence: established
The core risk is empirically well-demonstrated and replicated: Greshake et al. 2023 showed that indirect prompt injection in tool-augmented, LLM-integrated applications can hijack a model via untrusted content returned by tools (search results, emails, web pages), with practical attacks against real systems such as Bing's GPT-4-powered Chat. Dedicated agent benchmarks confirm this generalizes — InjecAgent (Zhan et al. 2024) found ReAct-prompted GPT-4 hijacked in ~24% of tool-integrated cases, and AgentDojo (Debenedetti et al. 2024) operationalizes the failure across 97 tasks / 629 security test cases in banking, email, and workspace settings. Caveat: attack success rates vary widely by model, scaffold, and defense, but the failure mode itself (tools as an untrusted-input attack surface) is robust and replicated.
Sources: Greshake et al. 2023 (Not what you've signed up for: Compromising real-world LLM-integrated applications with indirect prompt injection, ACM AISec; arXiv:2302.12173); Zhan et al. 2024 (InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated LLM Agents, Findings of ACL; arXiv:2403.02691); Debenedetti et al. 2024 (AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents, NeurIPS Datasets & Benchmarks; arXiv:2406.13352)
Does governance work?evidence: thin
Mitigations reduce but do not eliminate tool-use attacks, and no defense is shown robust against adaptive adversaries: design-based isolation like CaMeL (Debenedetti et al. 2025) solves 77% of AgentDojo tasks with provable security (vs 84% undefended) — strong but neither full task utility nor a universal guarantee — while 'The Attacker Moves Second' (Nasr, Carlini et al. 2025) bypassed 12 recent defenses with attack success rates above 90% for most, despite several originally reporting near-zero rates, and the largest public agent red-teaming competition (Zou et al. 2025) logged over 60,000 successful policy violations from 1.8 million injection attempts across 22 frontier agents and 44 deployment scenarios. There is no validated governance regime or technical mitigation demonstrated to reliably prevent tool-use hijacking at deployment scale.
Sources: Debenedetti et al. 2025 (CaMeL: Defeating Prompt Injections by Design; arXiv:2503.18813); Nasr, Carlini et al. 2025 (The Attacker Moves Second: Stronger Adaptive Attacks Bypass Defenses Against LLM Jailbreaks and Prompt Injections; arXiv:2510.09023); Zou et al. 2025 (Security Challenges in AI Agent Deployment: Insights from a Large Scale Public Competition; arXiv:2507.20526)

Editorial note

Tool-use safety is the sub-problem of agentic-system safety where the action surface is mediated by discrete tool calls. The boundary with general agentic-system safety is fuzzy when tools include code execution (which is effectively a universal action). Currency (2026-06-21): Definition accurate; material developments are real tool-use incidents and new heuristic mitigations plus incident-reporting regulation.

References

Sources cited inline in the analysis (linked from the superscript markers), then the primary instrument sources behind the classifications.

Mary Phuong, Matthew Aitchison, Elliot Catt, et al. (Google DeepMind) (2024) Evaluating Frontier Models for Dangerous Capabilities, arXiv (cs.LG). arXiv:2403.13793 — Pilots dangerous-capability evaluations (persuasion, cyber, self-proliferation) on frontier models, finding 'early warning signs' but no strong present danger — grounding evaluation-based gating. ↩
Shevlane, Farquhar, Garfinkel, et al. (2023) Model evaluation for extreme risks, arXiv. arXiv:2305.15324 — Proposes "dangerous capability evaluations" and alignment evaluations of frontier models so developers and policymakers can make "responsible decisions about model training, deployment, and security". ↩
Soice, Rocha, Cordova, Specter, Esvelt (2023) Can large language models democratize access to dual-use biotechnology?, arXiv. arXiv:2306.03809 — Red-team exercise finding LLM chatbots "may also confer easy access to dual-use technologies capable of inflicting great harm" and could make pandemic-class agents more widely accessible. ↩
arXiv:2404.13208 ↩
Jonas Schuett, Markus Anderljung, Alexis Carlier, Leonie Koessler, Ben Garfinkel (Centre for the Governance of AI) (2024) From Principles to Rules: A Regulatory Approach for Frontier AI, arXiv (GovAI working paper). arXiv:2407.07300 — Recommends frontier-AI regulation begin with high-level safety principles and migrate to detailed rules (e.g., mandated dangerous-capability evaluations) as regulatory capacity matures. ↩
Bengio, Hinton, Yao, Song, et al. (2024) Managing extreme AI risks amid rapid progress, Science. 10.1126/science.adn0117 — Warns "AI safety research is lagging" and present governance initiatives "lack the mechanisms and institutions to prevent misuse and recklessness", urging adaptive governance plus safety R&D. ↩
David Fernández-Llorca, Emilia Gómez, Ignacio Sánchez, Gabriele Mazzini (2025) An interdisciplinary account of the terminological choices by EU policymakers ahead of the final agreement on the AI Act: AI system, general purpose AI system, foundation model, and generative AI, Artificial Intelligence and Law. 10.1007/s10506-024-09412-y — Traces how the AI Act's legal text shifted across versions among the terms 'AI system, general purpose AI system, foundation model, and generative AI', exposing definitional instability in the regime. ↩
Martina Hulok (2025) The EU model of AI governance: regulating artificial intelligence through law and policy, ERA Forum. 10.1007/s12027-025-00869-1 — Analyses how the AI Act's risk-based model handles general-purpose and foundation models whose 'autonomous content generation challenges legal categories of authorship, accountability, and control'. ↩
Atoosa Kasirzadeh (2025) Two types of AI existential risk: decisive and accumulative, Philosophical Studies. 10.1007/s11098-025-02301-3 — Distinguishes 'decisive' (sudden takeover) from 'accumulative' AI existential risk, arguing governance must address gradual societal erosion as well as abrupt scenarios. ↩
Wallace, E., et al. (2024), 'The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions' (OpenAI) — the canonical industry articulation of instruction-channel hierarchy as a tool-use-safety defence.

Cite this article 8 formats · BibTeX, RIS, APA, Chicago, … · 1-click copy

@misc{policywindow-tool-use-safety,
  title  = {Tool-Use Safety},
  author = {Policy Window},
  year   = {n.d.},
  howpublished = {tool-use-safety — safety},
  url    = {https://policywindow.org/wiki/tool-use-safety},
  note   = {Primary source: https://arxiv.org/abs/2402.07896}
}

Verify the year + paste-and-refine. Primary source linked in BibTeX/RIS note.

Permalink downloads.bib .ris .csl.json

Persistent identifier: https://policywindow.org/wiki/tool-use-safety — committed-stable URL with content-versioning via ?asOf= (rollout pending per methodology §7). DOIs via Zenodo are on the roadmap.

Article tools — track changes, suggest an edit

View history — every captured revision of this article · What links here

Source: Edit on GitHub (search for `tool-use-safety`)

[ref-1] Mary Phuong, Matthew Aitchison, Elliot Catt, et al. (Google DeepMind) (2024) Evaluating Frontier Models for Dangerous Capabilities, arXiv (cs.LG). arXiv:2403.13793 — Pilots dangerous-capability evaluations (persuasion, cyber, self-proliferation) on frontier models, finding 'early warning signs' but no strong present danger — grounding evaluation-based gating. ↩

[ref-2] Shevlane, Farquhar, Garfinkel, et al. (2023) Model evaluation for extreme risks, arXiv. arXiv:2305.15324 — Proposes "dangerous capability evaluations" and alignment evaluations of frontier models so developers and policymakers can make "responsible decisions about model training, deployment, and security". ↩

[ref-3] Soice, Rocha, Cordova, Specter, Esvelt (2023) Can large language models democratize access to dual-use biotechnology?, arXiv. arXiv:2306.03809 — Red-team exercise finding LLM chatbots "may also confer easy access to dual-use technologies capable of inflicting great harm" and could make pandemic-class agents more widely accessible. ↩

[ref-4] arXiv:2404.13208 ↩

[ref-5] Jonas Schuett, Markus Anderljung, Alexis Carlier, Leonie Koessler, Ben Garfinkel (Centre for the Governance of AI) (2024) From Principles to Rules: A Regulatory Approach for Frontier AI, arXiv (GovAI working paper). arXiv:2407.07300 — Recommends frontier-AI regulation begin with high-level safety principles and migrate to detailed rules (e.g., mandated dangerous-capability evaluations) as regulatory capacity matures. ↩

[ref-6] Bengio, Hinton, Yao, Song, et al. (2024) Managing extreme AI risks amid rapid progress, Science. 10.1126/science.adn0117 — Warns "AI safety research is lagging" and present governance initiatives "lack the mechanisms and institutions to prevent misuse and recklessness", urging adaptive governance plus safety R&D. ↩

[ref-7] David Fernández-Llorca, Emilia Gómez, Ignacio Sánchez, Gabriele Mazzini (2025) An interdisciplinary account of the terminological choices by EU policymakers ahead of the final agreement on the AI Act: AI system, general purpose AI system, foundation model, and generative AI, Artificial Intelligence and Law. 10.1007/s10506-024-09412-y — Traces how the AI Act's legal text shifted across versions among the terms 'AI system, general purpose AI system, foundation model, and generative AI', exposing definitional instability in the regime. ↩

[ref-8] Martina Hulok (2025) The EU model of AI governance: regulating artificial intelligence through law and policy, ERA Forum. 10.1007/s12027-025-00869-1 — Analyses how the AI Act's risk-based model handles general-purpose and foundation models whose 'autonomous content generation challenges legal categories of authorship, accountability, and control'. ↩

[ref-9] Atoosa Kasirzadeh (2025) Two types of AI existential risk: decisive and accumulative, Philosophical Studies. 10.1007/s11098-025-02301-3 — Distinguishes 'decisive' (sudden takeover) from 'accumulative' AI existential risk, arguing governance must address gradual societal erosion as well as abrupt scenarios. ↩

[ref-10] Wallace, E., et al. (2024), 'The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions' (OpenAI) — the canonical industry articulation of instruction-channel hierarchy as a tool-use-safety defence.

Tool-Use Safety

Definition & scope

Unit of Analysis and Distinctions

Adversarial Mechanisms and Mitigations

Governance Relevance

Debates and Open Questions

Use in governance

How instruments operationalise this concept

Appears in topic articles

Editorial note

See also

Further reading

References

Tool-Use Safety

Definition & scope

Unit of Analysis and Distinctions

Adversarial Mechanisms and Mitigations

Governance Relevance

Debates and Open Questions

Use in governance

How instruments operationalise this concept

Appears in topic articles

Editorial note

See also

Further reading

References