Print-friendly view · use your browser's Save as PDF option (Cmd/Ctrl-P) to attach this article to a brief.

Tool-Use Safety

tool-use-safety · safety · concept

Source: https://policywindow.org/wiki/tool-use-safety

Generated 2026-05-30T22:11:20 UTC

Summary

The sub-domain of agentic-system safety concerned with the risks that arise when an AI model invokes external tools (search, code execution, APIs, financial transactions, system commands) — including risks of unintended action, instruction subversion, privilege escalation, and resource consumption.

At a glance

Used by: 1 instrument(s)
Related concepts: agentic-system, scalable-oversight, prompt-injection, alignment, capability-elicitation
Primary source: Wallace, E., et al. (2024), 'The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions' (OpenAI) — the canonical industry articulation of instruction-channel hierarchy as a tool-use-safety defence.
Source URL: https://arxiv.org/abs/2402.07896

Details

Tool-use safety treats the model + tool surface as the unit of analysis rather than the model in isolation. The risk surface expands along several axes: (a) capability composition — a chat-safe model may become capability-dangerous when given a code-execution tool plus internet access; (b) instruction-channel adversaries — tool outputs are an indirect-prompt-injection vector (a web search result containing adversarial instructions); (c) privilege escalation — tools that share authentication with the user may be invoked beyond user intent; (d) resource exhaustion — agents can spend money, compute, or API credits at machine speed; (e) confused-deputy attacks — the tool acts with the user's authority on instructions actually from a third party. Mitigation patterns include: capability allowlists (only specific tools, specific scopes), human-in-the-loop confirmation for high-impact actions (the OpenAI Operator + Anthropic Computer Use UX patterns), output-isolation tags (Anthropic's tool-result-tag scheme), and gateway-LLM patterns (Wallace et al. 2024 dual-LLM). NIST AI RMF GenAI Profile §2.7 'Value Chain and Component Integration' touches the tool-integration risk. EU AI Act Art. 14 'human oversight' is the closest binding obligation but presumes human-bandwidth-feasible review, which agentic systems break at scale. Industry-side frameworks (Anthropic RSP, OpenAI Preparedness) treat tool-use capability as a tier-relevant signal.

How to cite this article

APA

Policy Window. (n.d.). Tool-Use Safety [Wiki article — Concept]. https://policywindow.org/wiki/tool-use-safety

Chicago

Policy Window. n.d.. "Tool-Use Safety." Wiki article (Concept). https://policywindow.org/wiki/tool-use-safety.

Harvard

Policy Window (n.d.) 'Tool-Use Safety', Wiki article — Concept, available at: https://policywindow.org/wiki/tool-use-safety.

OSCOLA

Policy Window, 'Tool-Use Safety' (Wiki article — Concept, n.d.) <https://policywindow.org/wiki/tool-use-safety> accessed [date].

BibTeX

@misc{policywindow-tool-use-safety,
  title  = {Tool-Use Safety},
  author = {Policy Window},
  year   = {n.d.},
  howpublished = {tool-use-safety — safety},
  url    = {https://policywindow.org/wiki/tool-use-safety},
  note   = {Primary source: https://arxiv.org/abs/2402.07896}
}