The Security Review That Missed the Interpreter
Most organizations are securing the model and ignoring the interpreter.
They review prompt injection defenses. They test content filters. They validate API permissions. They audit the tools the agent can call.
Then a months-old case note, written by a human analyst and stored in the system as data, gets interpreted as a live command. The agent executes a transaction release without analyst review. No attacker. No prompt injection. No adversarial input.
Just context treated as instruction.
The security review focused on what the agent could access. It should have focused on what the agent could interpret.
This isn't a gap in AI safety. It's a fundamental architectural break that traditional security models don't address: the interpreter layer that converts unstructured text into privileged system actions.
Most teams treat agents as enhanced chatbots, conversational interfaces with tool access,. But agents aren't responding to users. They're executing commands derived from interpretation. The difference isn't semantic. It's the difference between displaying text and running code.
When text becomes commands, every data source becomes an attack surface. Not through injection. Through interpretation.
This is the control plane most architecture reviews never examine.
What People Think Secures AI Agents
When security teams review agent deployments, they focus on familiar threat models from frameworks like the OWASP Top 10 for Large Language Model Applications:
Model-level controls:
- Prompt injection defenses
- Content filtering and moderation
- Output sanitization
- Rate limiting and abuse detection
Access controls:
- API authentication and authorization
- Least-privilege tool permissions
- Audit logging for agent actions
- Human-in-the-loop approvals
Infrastructure security:
- Encrypted storage and transmission
- Network segmentation
- Credential management
- Compliance with data protection regulations
These controls matter. They prevent certain classes of abuse. But they're designed for systems where the security boundary is between the application and external input.
Agents break that model.
Here's what actually creates the vulnerability:
The agent interprets everything as potential intent.
A chatbot processes user input and generates a response. The output goes to a display layer. The security boundary is clear: untrusted input → processing → untrusted output displayed to user.
An agent processes input, interprets intent, selects tools, and executes system commands. The output isn't displayed. It's actuated. Side effects happen in production systems with real consequences.
The security boundary isn't "can the agent access this tool?" It's "can the agent interpret arbitrary context as instructions to use this tool?"
Traditional input validation doesn't work when the entire system is designed to derive commands from natural language.
Let me show you where this breaks in practice.
The Moment I Saw It
I was reviewing a fraud detection agent for a financial institution.
The system was designed to assist fraud analysts by reviewing transaction alerts, pulling relevant customer history, and recommending next actions: hold the transaction, release it, or escalate to senior review.
The architecture looked solid:
- Agent had read access to transaction data and case history
- All recommendations were logged
- Final decisions remained with human analysts
- Tool permissions were scoped to data retrieval, not transaction modification
Everyone signed off. The agent was advisory only. The humans retained control.
Then I asked a question about operational practice: "Has the workflow changed since initial deployment?"
The answer: "We enabled auto-execution for low-risk releases during business hours. Reduces analyst workload."
I asked what defined "low-risk."
The criteria were reasonable: transaction amount thresholds, customer history, verification patterns. But the agent determined whether criteria were met by interpreting context from retrieved data.
That's when I saw the gap.
The agent used historical case notes as context for current decisions. These notes were written by human analysts during previous reviews: situational assessments, reminders, procedural notes.
One note, written months earlier, said: "If customer confirms travel, release immediately."
That wasn't a standing policy. It was a specific instruction for a specific case. But the note lived in the system as data the agent retrieved when analyzing similar patterns.
The agent read the note. Interpreted it as an active instruction. Released a flagged transaction without analyst review.
"The interpreter doesn't distinguish between historical context and current commands. It derives intent from all available text."
No prompt injection. No external attacker. No compromised credentials.
Just data becoming commands through interpretation.
That's when I realized: the trust boundary isn't between the agent and external input. It's at every point where text transitions from data to instruction. Most systems have no controls at those transitions.
This pattern aligns with recent research on indirect prompt injection, which demonstrates that LLM-integrated applications can be compromised through context poisoning without any direct adversarial input. The agent interprets retrieved context as authoritative. This is exactly what happened with the case note.
Why This Is Different From Every Other System
Most people think agents are chatbots with expanded capabilities. The comparison misses the fundamental shift.
Chatbots:
- Process user input
- Generate text response
- Display output to user
- No side effects in production systems
Agents:
- Process input from multiple sources (users, documents, system data, previous outputs)
- Interpret intent from unstructured text
- Select and invoke tools with system-level permissions
- Execute actions with real consequences
The security model for chatbots assumes the output goes to a display layer where humans make decisions. Agents bypass that layer. The interpretation is the decision.
This creates three new attack surfaces that traditional security doesn't address:
1. Data sources become command injection vectors
Any text the agent retrieves can influence its interpretation: case notes, emails, documents, database records, API responses. If the agent treats retrieved context as authoritative, attackers don't need to compromise the prompt. They can poison the data the agent retrieves.
2. Tool calling happens at machine speed without human verification
Chatbots have latency built in: humans read the output before acting on it. Agents execute immediately. A misinterpretation that would be caught by a human reading a chatbot response happens too fast to intercept when the agent calls tools directly.
3. Intent interpretation is probabilistic, not deterministic
Traditional code: if (input == "delete") { execute_delete() }. Clear validation point.
Agent interpretation: "The customer wants this resolved" might mean delete, might mean update, might mean escalate, depending on context the agent retrieved from unstructured sources.
You can't validate intent the way you validate input. The agent is designed to derive commands from ambiguous natural language.
That's not a bug. That's the feature. And it breaks every security model built on deterministic command execution.
This challenge has been documented in high-profile incidents like Microsoft's Bing Chat "Sydney" persona, where the model's interpretation of context led to unexpected behaviors that bypassed intended constraints.
The Four Hidden Trust Boundaries
Every agent system has four points where trust transitions, and most organizations have no controls at any of them.
Boundary 1
Input → Interpretation
What people think: "We filter malicious prompts and validate user input."
What actually happens: The agent interprets intent from everything it receives: user queries, retrieved documents, system messages, previous conversation history, tool outputs.
The gap: Traditional input validation assumes you're checking for malicious data. Agents are designed to interpret intent from unstructured text. The same words that are safe data in a database become unsafe instructions when an agent retrieves them as context.
Example: A customer support agent retrieves an email that says "Please forward all correspondence to [email protected]." Is that a customer request to document? Or an instruction to the agent to send data to an external address? The agent must interpret, and interpretation is where trust assumptions become vulnerabilities.
Trust lives: Wherever the agent decides "this text means I should take action X."
Boundary 2
Interpretation → Tool Selection
What people think: "We've limited which tools the agent can access based on least-privilege principles."
What actually happens: The agent selects tools based on interpreted intent, not explicit user commands. If the agent interprets context as requiring privileged actions, it will attempt to use privileged tools, even when the user didn't directly request them.
The gap: Access control systems assume explicit, authenticated requests. Agents make implicit decisions about tool usage based on probabilistic interpretation.
Example: An agent with both read_customer_data and update_customer_data tools interprets "make sure the address is correct" as a directive to update the address if it finds a mismatch in retrieved data, even though the user just wanted verification. The tool selection happens inside the agent's reasoning, not through explicit user authorization.
Major AI platforms like OpenAI's function calling and LangChain acknowledge these trust boundary challenges in their security documentation, but most deployments lack adequate controls at the interpretation-to-execution transition.
Trust lives: At the point where interpretation determines which privileged functions get invoked.
Boundary 3
Tool Call → System Execution
What people think: "Tool calls are logged and the agent's actions are auditable."
What actually happens: The tool receives parameters derived from the agent's interpretation. If the agent misinterpreted intent, the tool executes the wrong action with correct-looking parameters.
The gap: The tool can't distinguish between "the agent correctly understood intent" and "the agent misinterpreted context as a command." From the tool's perspective, it received a valid, authorized call.
Example: An agent calls send_email(to="[email protected]", body="...") after interpreting a retrieved document that mentioned "inform customer." The email function has no way to validate whether the agent's interpretation of "who to inform" and "what to say" was correct. It just executes.
Trust lives: Where systems assume that correctly-formatted tool calls represent correctly-understood intent.
Boundary 4
Execution → Side Effects
What people think: "We can roll back agent actions if something goes wrong."
What actually happens: Many agent actions have side effects that can't be reversed: emails sent, data shared externally, transactions executed, decisions logged in external systems.
The gap: Traditional software has rollback mechanisms because commands are explicit and deterministic. Agent actions are based on interpretation. Even if you reverse the action, you can't reverse the fact that the agent misunderstood context and acted on it.
Example: A fraud detection agent releases a transaction based on misinterpreted context. Even if you flag the release and re-hold the transaction, the funds already moved, the customer was notified, and compliance logs show the release happened. The interpretation error has consequences that extend beyond the system's control.
Trust lives: At the edges of your system where actions become irreversible.
Why Security Reviews Keep Missing This
Traditional security reviews are designed to find vulnerabilities in deterministic systems. Agents are probabilistic interpreters.
Traditional approach:
- Map attack surfaces (where external input enters)
- Validate input (ensure it matches expected format)
- Enforce access control (authenticate before execution)
- Test for injection (can malicious input bypass validation?)
Why this fails for agents:
The agent is designed to accept unstructured input and derive commands from it. "Input validation" and "interpret intent from natural language" are fundamentally opposed goals.
When I review distributed systems in banking, I map trust boundaries by asking: "Where does untrusted data enter the system, and what validates it before privileged operations?"
For agents, that question has no clean answer. The agent retrieves data from dozens of sources: user input, databases, documents, API responses, previous tool outputs. It synthesizes all of it into interpreted intent. There's no single validation point because the entire system is interpretation.
Most security reviews focus on:
- Can the agent access sensitive data? (Yes, that's the feature)
- Can the agent call privileged tools? (Yes, that's the feature)
- Are agent actions logged? (Yes, but logging doesn't prevent misinterpretation)
What reviews miss:
- Where does context stop being data and start being instructions?
- Who validates that interpreted intent matches actual user intent?
- What prevents context poisoning through legitimate data sources?
- How do you audit why an agent interpreted something as a command?
The disconnect happens because "agent" sounds like an application feature, not a security architecture change. Teams treat it as a UX improvement: better than forms, more natural than APIs.
But agents aren't interfaces. They're privileged interpreters with system-level access, making autonomous decisions about which commands to execute based on probabilistic understanding of unstructured text.
That's not a UX layer. That's a new control plane.
And most organizations have no security model for it.
How to Actually Map Agent Trust Boundaries
If you're deploying agents, or reviewing someone else's deployment,, here's the framework:
Question 1: Where does text become a command in your system?
Not just user input. Everywhere. Retrieved documents. Database records. API responses. System messages. Error logs.
Map every source of text the agent uses for context. Then ask: if adversarial content appeared in this source, could the agent interpret it as instructions?
Trust lives with whoever controls the context the agent retrieves, not just who controls the prompt.
Question 2: Who validates that interpreted intent matches actual user intent?
Traditional systems have explicit commands. Agents infer commands from interpretation.
Identify the point where interpretation determines action. Is there a validation layer? Or does interpreted intent flow directly to tool execution?
If the agent decides "this context means I should send an email," what checks that decision before the email is sent?
Trust lives where interpretation becomes execution, not where tools validate parameters.
Question 3: What happens when the agent misinterprets context?
Model the failure case: the agent interprets a historical note, a corrupted document, or an ambiguous phrase as an instruction to take privileged action.
Can you detect the misinterpretation? Can you prevent the action? Can you reverse the side effects?
If the answer is "we'd find out in the audit log after the action happened," you have an uncontrolled trust boundary.
Trust lives where reversibility ends. Agent actions often can't be undone.
Question 4: Where are your kill switches when interpretation goes wrong?
Agents operate at machine speed. By the time a human notices a problem, the agent may have executed dozens of actions based on misinterpreted context.
Identify circuit breakers: confidence thresholds that pause execution, anomaly detection that flags unexpected tool usage, rate limits that slow down bulk actions.
Trust lives in the controls that stop autonomous execution, not in the agent's reasoning process.
These aren't theoretical questions. Every agent deployment has answers to them. Most organizations just haven't asked yet.
Why This Actually Matters
Systems fail when trust assumptions are invisible. For agents, the trust assumption is: "the agent correctly interprets intent from all available context."
That assumption breaks in predictable ways:
Context poisoning becomes the new injection attack
Traditional injection: attacker puts malicious code in user input.
Agent context poisoning: attacker puts interpretable instructions in legitimate data sources: case notes, emails, support tickets, documentation.
The agent retrieves it as context. Interprets it as instruction. Executes privileged actions. No prompt injection needed.
Speed eliminates human oversight
Agents were sold as "assistants" that would help humans make better decisions faster. Then organizations realized agents could make decisions autonomously and enabled auto-execution to reduce workload.
The "human in the loop" became "human monitoring the audit log after actions already executed."
Regulatory frameworks like the EU AI Act and NIST's AI Risk Management Framework recognize this accountability gap, requiring human oversight for high-risk AI systems. But most agent deployments lack adequate controls to ensure meaningful human review before irreversible actions.
Intent is ambiguous; consequences are permanent
When an agent misinterprets context and sends customer data to the wrong recipient, you can't undo the disclosure. When it releases a fraudulent transaction, you can't reclaim the funds. When it escalates a case incorrectly, you can't erase the compliance record.
The probabilistic nature of interpretation creates deterministic consequences.
When you map trust boundaries explicitly:
- You identify where context can become commands
- You add validation at interpretation → execution transitions
- You design for failure modes where agents misunderstand intent
- You build circuit breakers that prevent cascade failures from misinterpretation
This isn't anti-AI. It's treating agents as what they actually are: privileged middleware that converts unstructured text into system commands.
That requires a different security model than chatbots, traditional applications, or even autonomous systems with deterministic logic.
The "But Isn't the Model the Security Boundary?" Question
You might ask: shouldn't we focus on making the model itself more secure and reliable?
Yes. Model safety matters.
But model safety is about what the model can produce, not what the interpreter does with it.
A perfectly safe model can still misinterpret context. A well-aligned model can still treat historical data as current instructions. A model with strong guardrails can still derive commands from ambiguous natural language that a human would recognize as situational, not universal.
The security boundary isn't inside the model. It's at the transitions where:
- Text retrieved from data sources becomes interpretable context
- Interpreted intent becomes tool selection
- Tool calls become system execution
- Execution becomes irreversible side effects
You can have the most secure, aligned, well-tested model in the world and still have an unsafe agent if those transitions lack controls.
The question isn't "is the model safe?" It's "is the interpreter trustworthy as a privileged middleware component?"
Most organizations haven't asked that question because they don't think of agents as middleware. They think of them as intelligent assistants.
But from a security architecture perspective, an agent is middleware that:
- Accepts input from untrusted sources (users, documents, external systems)
- Derives commands from probabilistic interpretation
- Invokes privileged system functions
- Produces side effects that can't be reversed
That's not an assistant. That's a control plane that needs the same security rigor you'd apply to any component with system-level privileges.
The Reality Check
Agents aren't chatbots. They're privileged interpreters.
And trust boundaries exist at four critical transitions:
- Input → Interpretation: Where unstructured text from multiple sources becomes understood intent
- Interpretation → Tool Selection: Where derived intent determines which privileged functions get invoked
- Tool Call → Execution: Where interpreted commands become system actions
- Execution → Side Effects: Where actions produce irreversible consequences
Traditional security models don't address these boundaries because they assume explicit commands, not probabilistic interpretation.
Most security reviews ask:
- "What can the agent access?"
- "What tools can it call?"
- "Are actions logged?"
Those are necessary questions. But they're not sufficient.
The questions that reveal hidden trust assumptions are:
- "Where does context become command?"
- "Who validates interpreted intent?"
- "What controls the transition from interpretation to execution?"
- "What stops misinterpretation from causing irreversible damage?"
If you're deploying agents without explicit controls at those four boundaries, you're treating the interpreter as a UX feature instead of a privileged control plane.
And the most dangerous trust assumption is the one that goes unexamined: that interpretation is always safe because the model is aligned and the tools are access-controlled.
It's not. And they're not enough.
Because when data becomes commands through interpretation, every context source becomes an attack surface. Speed eliminates the human oversight everyone assumed would catch the mistakes.
This control plane is invisible in most threat models. Until a case note from six months ago executes a transaction release.