The sub-domain of agentic-system safety concerned with the risks that arise when an AI model invokes external tools (search, code execution, APIs, financial transactions, system commands) — including risks of unintended action, instruction subversion, privilege escalation, and resource consumption.
Definition and scope
Tool-use safety treats the model + tool surface as the unit of analysis rather than the model in isolation. The risk surface expands along several axes: (a) capability composition — a chat-safe model may become capability-dangerous when given a code-execution tool plus internet access; (b) instruction-channel adversaries — tool outputs are an indirect-prompt-injection vector (a web search result containing adversarial instructions); (c) privilege escalation — tools that share authentication with the user may be invoked beyond user intent; (d) resource exhaustion — agents can spend money, compute, or API credits at machine speed; (e) confused-deputy attacks — the tool acts with the user's authority on instructions actually from a third party. Mitigation patterns include: capability allowlists (only specific tools, specific scopes), human-in-the-loop confirmation for high-impact actions (the OpenAI Operator + Anthropic Computer Use UX patterns), output-isolation tags (Anthropic's tool-result-tag scheme), and gateway-LLM patterns (Wallace et al. 2024 dual-LLM). NIST AI RMF GenAI Profile §2.7 'Value Chain and Component Integration' touches the tool-integration risk. EU AI Act Art. 14 'human oversight' is the closest binding obligation but presumes human-bandwidth-feasible review, which agentic systems break at scale. Industry-side frameworks (Anthropic RSP, OpenAI Preparedness) treat tool-use capability as a tier-relevant signal.
Used by these instruments
Related concepts
- Agentic AI System— An AI system that takes actions in the world — calling tools, executing code, browsing the web, send
- Scalable Oversight— The set of techniques for supervising AI systems whose outputs are too complex, too numerous, or too
- Prompt Injection— An adversarial input technique in which untrusted content fed to an AI model (e.g., text on a webpag
- AI Alignment— The technical problem of designing AI systems whose objectives, behaviour, and emergent goals reliab
- Capability Elicitation— Techniques designed to reveal the upper bounds of an AI model's capabilities, rather than measuring
Appears in topic articles
Editorial note
Tool-use safety is the sub-problem of agentic-system safety where the action surface is mediated by discrete tool calls. The boundary with general agentic-system safety is fuzzy when tools include code execution (which is effectively a universal action).
References
Take this further — sign up free
Save, compare, or get alerts when Tool-Use Safety changes. Policy Window is the analyst workbench layered on top of this wiki — free for researchers, civil society, and verified policymakers.