Prompt Injection Attacks on AI Agents: Risks and Defenses

Emily Winks

Data Governance Expert

Updated:05/04/2026

Published:05/04/2026

26 min read

Watch Context Agents Live Get Context Layer Ebook

Key takeaways

OWASP ranks prompt injection as the #1 LLM vulnerability — the AI equivalent of SQL injection for agentic systems.
Indirect injection hides malicious instructions in documents, emails, and web pages that agents retrieve and trust.
Governed context — controlling what agents retrieve and treat as truth — shrinks the attack surface at the foundation.

What are prompt injection attacks on AI agents?

A prompt injection attack manipulates an AI agent by embedding malicious instructions into its input or context, overriding its original goals. Because LLMs cannot structurally separate instructions from data, attackers can hijack agent behavior through user messages, retrieved documents, or poisoned memory stores — causing agents to exfiltrate data, execute unauthorized commands, or produce harmful outputs.

Attack types

Direct injection — malicious instructions embedded directly in user input
Indirect injection — malicious instructions hidden in retrieved documents or data
Memory poisoning — corrupting agent long-term memory or context stores

Are your AI agents stuck in POC?

Assess Context Maturity

What is prompt injection?

Prompt injection is a cyberattack technique that manipulates an AI system by embedding malicious instructions inside its input, causing the model to override its intended instructions and behave in ways the developer — or the user — never authorized.

The analogy to SQL injection is useful for grasping severity. In SQL injection, an attacker inserts database commands into a web form that a server naively executes. In prompt injection, an attacker inserts natural-language commands into an LLM’s input stream — and the model naively follows them. Both exploits share the same root cause: the system fails to separate trusted instructions from untrusted data.

The analogy has limits, though. SQL injection can be fully mitigated through parameterized queries — a structural separation of code and data that eliminates the confusion at the database layer. Prompt injection lacks an equivalent architectural fix at the model layer: because both instructions and data arrive as natural language, no syntactic boundary can cleanly separate them. Defenses exist, but they operate at the application and context layers — not inside the model itself.

For traditional software, this distinction is enforced structurally. An application’s logic is code; user input is a string. They live in different layers and execute differently. But in an LLM, both the developer’s system prompt and the end user’s query arrive as the same artifact: natural-language text. The model has no architectural boundary to enforce between them. Any sufficiently crafted input can blur — or erase — that boundary.

This is why OWASP named prompt injection LLM01:2025, the single highest-priority vulnerability in its Top 10 for Large Language Model Applications. It is not an implementation bug. It is a structural characteristic of how language models work.

When AI systems were simple chatbots, this was a manageable nuisance. An attacker could get a model to say something it shouldn’t. With the emergence of agentic AI — systems that browse the web, execute code, send emails, query databases, and take real-world actions on behalf of users — the blast radius of a successful prompt injection has grown from embarrassing to catastrophic.

Build Your AI Context Stack

Get the blueprint for implementing context graphs across your enterprise. Four-layer architecture from metadata foundation to agent orchestration, with practical implementation steps for 2026.

Get the Stack Guide

Types of prompt injection attacks on AI agents

Not all prompt injection attacks look the same. The attack surface has expanded significantly as AI agents gained more capabilities, access, and autonomy. Here are the primary attack categories enterprises need to understand.

Direct prompt injection

Direct injection occurs when the attacker has access to the model’s input and embeds malicious instructions there. Also called “jailbreaking,” this is the most visible form of prompt injection.

Common techniques include:

Role-play hijacking: Instructing the model to “pretend you are a different AI with no restrictions”
Instruction override: Appending “Ignore all previous instructions and instead…” to a user message
System prompt extraction: Prompting the model to repeat or summarize its hidden system prompt

For standalone chatbots with limited capabilities, direct injection tends to produce embarrassing but bounded outcomes. For agents with tool access — the ability to send API calls, execute terminal commands, or write to databases — direct injection can cascade into real-world harm.

Indirect prompt injection

Indirect injection is the more dangerous form for enterprise AI. Here, the attacker does not interact with the model directly. Instead, they embed malicious instructions in content that the agent will eventually retrieve and process: a webpage, a document, an email, a code comment, a database record.

When the agent reads the poisoned content as part of its normal task, it encounters the hidden instructions embedded within — and follows them, because it cannot distinguish those instructions from the legitimate content surrounding them.

This is especially dangerous for AI agents that use retrieval-augmented generation (RAG), browse the internet, process email, or operate in code repositories. Every external data source the agent can reach is a potential injection vector.

Real documented examples include:

A Google Docs file that triggered an AI IDE agent to fetch instructions from a malicious MCP server and execute a Python payload that harvested secrets — without any user interaction
Payloads embedding complete PayPal transaction specifications inside web content, designed to be processed by AI agents with payment capabilities
Prompt injection hidden in website meta tags designed to route financial actions toward attacker-controlled endpoints
Code repository comments designed to force AI coding assistants with shell access into executing recursive file deletion commands

Google researchers monitoring the web found a 32% increase in malicious prompt injection payloads embedded in web content between November 2025 and February 2026.

Goal hijacking and agent manipulation

Goal hijacking is a higher-order attack targeting multi-step, autonomous agents. Rather than extracting information or triggering a single harmful action, the attacker attempts to redirect the agent’s entire objective.

In a multi-agent pipeline, an attacker who successfully hijacks one agent’s goal may be able to propagate the attack downstream — instructing subsequent agents, poisoning shared memory, or manipulating orchestrator decisions. The impact scales with the agent’s autonomy and the depth of its tool access.

Jailbreaking and safety bypass

Jailbreaking targets the model’s safety training directly. Attackers craft prompts designed to elicit outputs the model was trained to refuse: harmful content, dangerous instructions, or disclosure of confidential system prompts.

While model providers invest heavily in safety training, jailbreaks emerge regularly because safety alignment is an ongoing arms race, not a solved problem. More importantly for enterprises, jailbreaks that bypass safety training are fundamentally different from — and should not be confused with — access control. A model that won’t “say” something harmful is different from a system that enforces who can retrieve which data.

Data poisoning and memory attacks

Beyond manipulating what an agent does in a single session, attackers can target the persistent context that agents rely on: their knowledge bases, RAG indexes, long-term memory stores, and fine-tuning datasets.

RAG poisoning

Retrieval-augmented generation systems retrieve relevant documents from a vector database before generating a response. This architecture is powerful — it gives agents access to up-to-date, domain-specific knowledge. It also creates a new attack surface.

Research presented at USENIX Security 2025 demonstrated PoisonedRAG, the first systematic knowledge database corruption attack against RAG systems. By inserting a small number of carefully crafted documents into the retrieval corpus, attackers can cause the RAG system to reliably return attacker-chosen answers for specific queries — without touching the underlying model.

Researchers have since demonstrated related techniques for corrupting vector databases by injecting poisoned embeddings at specific points in the semantic space. The embeddings appear legitimate to the retrieval system but carry semantics designed to override correct answers.

The implication: even a model that behaves perfectly on trusted data can produce manipulated outputs if its retrieval corpus has been contaminated.

Memory poisoning

Long-horizon agents increasingly use persistent memory stores — structured databases of facts, preferences, past decisions, and contextual knowledge — to maintain continuity across sessions. This memory is a high-value target.

A successful memory poisoning attack modifies what the agent believes to be true about the world, about users, or about its own prior decisions. Because the agent trusts its memory as authoritative context, these modifications can redirect behavior across many future sessions.

A subtler variant is decision drift: the slow, cumulative corruption of an agent’s behavior through repeated exposure to poisoned context. The agent doesn’t behave incorrectly in any single obvious step — it gradually shifts from safe operations to unsafe ones, without triggering immediate alarms.

Fine-tuning data poisoning

Organizations that fine-tune models on internal data face a third attack surface: the training data itself. Researchers have demonstrated that hidden instructions embedded in code comments on GitHub can survive the fine-tuning process and become backdoors in the resulting model. DeepSeek’s DeepThink-R1 was found to have learned a backdoor through contaminated training repositories.

For enterprises using fine-tuning to customize models for internal use, the provenance and governance of training data is a direct security concern.

Real-world examples and attack scenarios

The threat is not theoretical. Prompt injection attacks are occurring in production systems today.

Devin AI (agentic coding assistant): Security researchers found that Devin AI was entirely defenseless against prompt injection. By crafting specific prompts, attackers could instruct the agent to expose server ports to the internet, leak access tokens to external endpoints, and install command-and-control malware — all within the scope of what appeared to be a routine coding task.

AI IDE zero-click attack: Researchers demonstrated an attack in which a Google Docs file — appearing entirely innocuous — triggered an AI coding agent inside an IDE to contact a malicious MCP server. The agent retrieved attacker-authored instructions, executed a Python payload, and harvested developer secrets. The victim took no action beyond opening the document.

AI ad review bypass (December 2025): Attackers embedded indirect prompt injection payloads in product listings submitted to an AI-based ad moderation system. The injected instructions caused the agent to approve advertisements it was designed to reject — including fraudulent content that would harm consumers.

Financial fraud via AI agents: Researchers documented injection payloads embedded in publicly accessible web pages that included fully specified payment transaction details (recipient, amount, description) with step-by-step instructions for AI agents with payment integration to execute the transaction without user confirmation.

Scale in practice: In a public red-teaming competition against deployed AI agents, researchers launched 1.8 million prompt injection attempts. More than 60,000 succeeded in causing policy violations — a success rate that would be unacceptable for any other security control.

These examples share a pattern: the attack works because the agent trusts external content. The moment an agent reads from the web, processes email, retrieves documents, or executes tools with broad permissions, every external source it interacts with becomes a potential injection vector. The success rate of individual attacks varies — the red-teaming competition cited above saw roughly 3.3% of 1.8 million attempts succeed — but at the scale enterprises operate, even low-probability attacks produce material risk.

Inside Atlan AI Labs & The 5x Accuracy Factor

Learn how context engineering drove 5x AI accuracy in real customer systems. Explore real experiments, quantifiable results, and a repeatable playbook for closing the gap between AI demos and production-ready systems.

Download E-Book

Why traditional security tools aren’t enough

Many enterprises have approached AI security by extending existing tools: web application firewalls, input sanitization, perimeter monitoring, DLP solutions. These controls are necessary, but insufficient for prompt injection on agentic AI.

The fundamental problem: traditional security tools govern at the network, application, or data-transport layer. Prompt injection operates at the semantic layer — it is a natural-language attack against a natural-language processing system. A packet filter or WAF has no mechanism to evaluate whether a retrieved document contains hidden instructions for an AI agent.

Why model-level controls are bypassable: Safety training and prompt filtering reduce the probability of successful attacks, but they are not architectural controls. They can be circumvented through creative phrasing, context manipulation, or indirect instruction delivery. Prompt injection exploits are a cat-and-mouse game: every safety training update creates an incentive to find new bypass patterns.

The blast radius problem: Traditional software has well-defined boundaries. An application does what its code specifies; a SQL injection attack can reach the database but not the email system. Agentic AI intentionally has broad capabilities. An agent that can browse the web, query databases, send email, and execute code has a blast radius that spans the enterprise. A single successful injection can trigger a cascade of authorized-but-attacker-directed actions.

The cost of inaction: According to IBM’s 2025 Cost of a Data Breach Report, breaches involving AI systems where access controls were absent averaged $5.72 million. Organizations compromised through shadow AI — AI tools deployed without governance — faced an average cost of $4.63 million, $670,000 above the baseline. IBM’s research also found that 97% of organizations that experienced AI model or application breaches reported lacking proper AI access controls at the time of the incident.

The data is unambiguous: governance gaps in AI systems translate directly into financial exposure.

Defense strategies: a layered approach

No single control eliminates prompt injection risk. The only viable approach is defense in depth — multiple overlapping layers that each reduce probability of success and limit blast radius. Here are the most effective controls the security research community has converged on.

Input validation and content filtering

Every piece of external content that reaches an AI agent — user messages, retrieved documents, API responses, tool outputs — should be validated before the agent processes it. This includes:

Pattern-based detection for known injection signatures
LLM-based classifier models trained to identify injection attempts
Source allowlisting: restricting which external domains or data sources the agent can retrieve from
Content-type validation: enforcing that documents conform to expected formats and structures

Critical limitation: Input validation alone is not a solution. The LLM cannot structurally separate instructions from data — it is a semantic system, and sophisticated injections will find paths that pattern matchers miss. LLM-based classifiers used for injection detection also carry false positive costs: they can flag legitimate inputs and create operational friction. Treat input validation as one layer, not the complete defense.

Output sanitization and action verification

Rather than only filtering what goes in, validate what comes out — specifically, what the agent plans to do. The “guardian pattern” implements a separate, smaller validation model that reviews the primary agent’s planned actions before execution:

Does this action align with the user’s stated goal?
Does it involve file system access, network calls, or data exports that appear unrelated to the task?
Does it require permissions beyond what this agent should have?

If the answer to any of these is suspicious, the action is blocked or escalated for human review before execution.

Privilege separation and least-privilege access

The principle of least privilege — familiar from traditional access control — applies directly to AI agents. An agent should only have access to the tools, data, and permissions it needs for its specific task.

Practical implementations include:

Per-task permission scopes: each agent invocation scoped to the minimum permissions required
Per-tool privilege profiles: defining exactly what each tool can access, what rate limits apply, and what egress destinations are permitted
No root-level execution: agents should never run with administrative or superuser permissions
Sandboxed execution environments: tool execution happens in isolated containers with restricted network and file system access

When an agent’s permissions are minimized, a successful injection attack produces a smaller blast radius. The agent cannot exfiltrate data it cannot access, execute tools it cannot invoke, or escalate privileges it was never granted.

Sandboxing and environment isolation

Sandboxing contains the damage from a successful attack. Key controls:

Network egress filtering: limit which external endpoints the agent can reach
File system isolation: confine agent file access to specific directories
Process isolation: prevent agents from spawning child processes or accessing system resources
Ephemeral environments: spin up fresh execution environments per task so compromised state does not persist

Structural separation of trusted and untrusted data

The CaMeL framework (developed in 2025) illustrates a promising architectural approach: explicitly separating control flow from data flow at the system level.

A privileged LLM (P-LLM) processes only trusted user queries and generates a structured plan. A separate, quarantined LLM (Q-LLM) handles all untrusted external data — emails, web content, retrieved documents. The Q-LLM can process untrusted content but cannot modify the plan generated by the P-LLM.

This architectural separation does not eliminate injection risk, but it significantly reduces the attack surface by preventing untrusted data from reaching the instruction-processing path.

Human-in-the-loop for high-impact actions

For actions with significant real-world consequences — sending emails, executing financial transactions, modifying production databases, deploying code — require human confirmation before execution. This is the most structurally sound defense against high-impact injection attacks, because it interposes a break in the automated attack chain that cannot be bypassed by manipulating the model alone. Human judgment is not infallible — social engineering remains a risk — but it removes the fully automated escalation path that makes agentic injection attacks uniquely dangerous.

The trade-off is reduced automation. Design the approval gate to be proportional to the action’s risk: low-risk read operations can execute automatically; high-risk write operations require confirmation.

The context-layer defense: governed metadata as protection

Technical controls like input validation and sandboxing address the symptoms of prompt injection. The context layer approach addresses the root cause: agents operating without governed, authoritative, access-controlled context.

Most enterprises are over-invested in “red teaming prompts” — trying to anticipate and filter attacker inputs — and under-invested in governing what agents can know, retrieve, and treat as canonical truth.

The distinction matters. An attacker who successfully injects malicious instructions into an agent’s context is dangerous precisely because the agent has access to sensitive data, powerful tools, and no authoritative source of truth to validate against. Govern the context, and the attack surface shrinks at the foundation.

What context-layer governance provides

Authoritative substrate: Atlan’s context layer gives agents a governed source of truth — canonical metric definitions, validated data lineage, certified data quality, and active governance policies. When an agent reasons from this authoritative substrate, malicious instructions embedded in retrieved content must contend with the fact that the agent has a certified definition of what “revenue” means, which data assets are classified as sensitive, and which actions the current policy permits.

An attacker can inject the instruction “ignore your data classification rules.” But if those rules are enforced at retrieval time by the context layer — not just stated in the system prompt — they cannot be overridden by a downstream instruction.

Retrieval-time access control: Every agent query against Atlan’s context layer is evaluated at retrieval time against the requesting agent’s role, the data asset’s classification, and the current governance policy. This is zero-trust applied to the context layer — not just identity, but data access, evaluated continuously at every call.

The consequence: an agent operating under a compromised prompt cannot retrieve data it is not authorized to see. The injection may succeed in redirecting the agent’s goals, but the data it would need to act on those goals is not accessible.

Provenance and auditability: Every memory entry — every policy, metric definition, entity attribute — has documented provenance, ownership, and validation status. When a poisoning attack attempts to introduce false context, the provenance trail exposes it: this assertion lacks a verified source, this metric definition conflicts with the certified version, this data asset shows no legitimate update event.

Metadata-only operation: Atlan AI operates on metadata, not raw data. This is a structural blast-radius limiter. Even in a scenario where an agent’s prompt is fully compromised, the agent’s access to sensitive customer records, financial transactions, or operational data is mediated through metadata interfaces — not direct data-plane access.

Decision traces and lineage: Every agent output is linked back to the exact upstream assets and governance decisions that produced it. When a poisoned context entry enters the pipeline and influences an agent decision, the decision trace exposes where it entered — enabling rapid detection, remediation, and regulatory audit.

The inside-out defense

Traditional AI security vendors protect agents from the outside in: monitoring agent behavior, filtering inputs, sandboxing execution, detecting anomalies after the fact. These controls are necessary, and Atlan is designed to complement them — not replace them.

Where Atlan adds a layer those tools cannot: it governs the context, metadata, and memory that drive every agent decision. Security tools contain and detect attacks; context layer governance reduces what a successful attack can accomplish by removing the conditions that make it effective — broad, ungoverned access and unverified context that an injected agent can exploit.

For enterprises deploying AI agents at scale, the combination is what makes production AI viable: perimeter security to reduce attack surface and detect intrusions; context-layer governance to enforce least privilege, maintain authoritative context, and limit blast radius.

See how this works in practice: AI Agent Memory Governance | Context Engineering for AI Governance | AI Security: Enterprise Framework

Enterprise AI security frameworks and compliance requirements

NIST AI Risk Management Framework (AI RMF)

NIST’s AI RMF provides the primary enterprise governance structure for managing AI risk in the US. For prompt injection specifically:

The framework mandates threat modeling for semantic attack vectors — explicitly recognizing prompt injection as a category of AI-specific risk
NIST AI 600-1 (Adversarial Machine Learning) provides technical guidance on prompt injection, including detection metrics and response requirements
NIST IR 8596 (Agentic AI Profile) addresses the specific risk characteristics of autonomous agents, including the expanded blast radius of agentic systems and the need for scope-limited permissions

A critical limitation noted in NIST IR 8596 (the Agentic AI Profile): the RMF’s risk contextualization machinery currently stops at the model boundary. Organizations using the RMF to govern agentic deployments cannot use it alone to reason about what happens when an agent with code execution capability encounters a prompt injection attack through a tool output. Supplementary controls at the context and data layers are required to address this gap.

EU AI Act security requirements

The EU AI Act (Regulation (EU) 2024/1689) established mandatory security requirements for high-risk AI systems, with phased enforcement:

February 2025: Prohibitions on unacceptable-risk AI systems took effect
August 2025: Obligations for general-purpose AI model providers activated
August 2026: Full obligations for all high-risk AI system operators

For high-risk AI systems, the Act requires:

Appropriate levels of accuracy, robustness, and cybersecurity — explicitly including resistance to adversarial attacks
Continuous risk management processes throughout the entire system lifecycle
Data governance covering training data quality, provenance, and contamination controls
Comprehensive technical documentation enabling post-incident audit
Human oversight mechanisms sufficient to detect and correct unexpected agent behavior

Prompt injection and data poisoning are directly addressed under the cybersecurity robustness requirements. Organizations deploying high-risk AI systems that lack controls for these attack vectors face both regulatory exposure and the operational liability that comes with governed AI failures.

ISO 42001 and emerging standards

ISO 42001, the AI management systems standard, now includes specific controls for prompt injection prevention and detection as part of its risk management requirements. Organizations seeking certification must demonstrate active controls, not just policy documentation.

The compliance picture is consistent across frameworks: regulators and standards bodies have recognized prompt injection as a first-class enterprise risk, and they expect documented controls at the model, application, and context layers.

Real stories from real customers: Governing AI context to reduce risk

"We're excited to build the future of AI governance with Atlan. All of the work that we did to get to a shared language at Workday can be leveraged by AI via Atlan's MCP server…as part of Atlan's AI Labs, we're co-building the semantic layer that AI needs with new constructs, like context products."

— Joe DosSantos, VP of Enterprise Data & Analytics, Workday

Watch Now →

"AI initiatives require more context than ever. Atlan's metadata lakehouse is configurable, intuitive, and able to scale to hundreds of millions of assets. As we're doing this, we're making life easier for data scientists and speeding up innovation."

— Andrew Reiskind, Chief Data Officer, Mastercard

Watch Now →

The CIO's Guide to Context Graphs

Discover the key strategies that CIOs are using to implement context layers and scale AI.

Get the Guide

Building an enterprise-ready AI security posture

AI security is now primarily an architecture problem. The enterprises that will succeed with AI agents are not the ones with the most aggressive prompt filters — they are the ones that have built governed, auditable, least-privilege AI architectures where the blast radius of any compromise is structurally limited.

A practical framework for enterprise AI security against prompt injection:

At the model layer: Safety training, jailbreak resistance, and model-level prompt filtering reduce the probability of simple attacks. Choose models with demonstrated resistance to injection and update them regularly as bypass techniques evolve.

At the application layer: Input validation, output verification, the guardian pattern, sandboxed tool execution, and human confirmation for high-impact actions contain attacks that reach the application.

At the context layer: Governed metadata, retrieval-time access control, provenance tracking, and zero-trust agent authentication prevent attacks from producing their intended effects even when other layers are bypassed.

Across all layers: Behavioral monitoring, decision traces, and audit logs that integrate with SIEM/SOAR platforms enable rapid detection and incident response when attacks do occur.

No single layer is sufficient. The goal is an architecture where a successful prompt injection attack — which OWASP’s evidence suggests will occur — produces contained, detectable, recoverable damage rather than unrestricted compromise.

The data makes the investment case: IBM’s 2025 breach report found that organizations with comprehensive AI security controls save an average of $1.9 million per breach incident compared to those without such controls. The cost of ungoverned AI is not hypothetical — it is documented, quantified, and growing.

Book a Demo

SVG: AI agent attack surface and context-layer defense

Frequently asked questions

1. What makes prompt injection different from a traditional cyberattack?

Traditional attacks exploit bugs in code — buffer overflows, SQL injection, authentication bypasses. These are implementation flaws that can, in principle, be patched. Prompt injection exploits a structural characteristic of language models: they cannot distinguish between developer instructions and user-provided data, because both arrive as natural language text. This is not a bug that can be patched out of a model — it requires architectural controls at the application and context layers.

2. Can prompt injection be fully prevented?

No. Because the vulnerability is structural, it cannot be fully eliminated through any single control. The goal of enterprise AI security is not perfect prevention but controlled blast radius: ensuring that a successful injection cannot escalate into catastrophic compromise because access controls, context governance, and sandboxing limit what the injected instructions can actually accomplish.

3. What is the difference between direct and indirect prompt injection?

Direct injection occurs when the attacker interacts with the AI system directly — embedding malicious instructions in user input. Indirect injection occurs when the attacker embeds instructions in external content (a document, a webpage, an email) that the AI agent retrieves as part of a legitimate task. Indirect injection is generally more dangerous because it does not require the attacker to have any direct access to the AI system.

4. How does RAG increase prompt injection risk?

RAG systems retrieve external documents to supplement model responses. Every document in the retrieval corpus is a potential attack vector for indirect prompt injection. An attacker who can influence any document in the retrieval index — through web content, shared document repositories, or public data sources — can potentially inject instructions that the agent will process as legitimate context. Governing the retrieval corpus — provenance tracking, source allowlisting, content filtering — is essential for RAG security.

5. What is memory poisoning in AI agents?

Memory poisoning attacks target the persistent context stores that AI agents use to maintain continuity across sessions: facts databases, preference stores, decision histories. By modifying what the agent “remembers” to be true, an attacker can redirect the agent’s behavior across many future sessions. Because agents treat their memory as authoritative context, memory poisoning can be more durable than session-level injection attacks.

6. How does OWASP classify prompt injection?

OWASP designates prompt injection as LLM01:2025 — the highest-priority vulnerability in its Top 10 for Large Language Model Applications. The classification reflects community consensus that prompt injection is a fundamental architectural risk, not an implementation flaw, and that it is the most prevalent and impactful vulnerability in deployed LLM applications.

7. What compliance frameworks address prompt injection?

Three major frameworks explicitly address prompt injection: NIST AI RMF (including NIST AI 600-1 and the Agentic AI Profile in NIST IR 8596), the EU AI Act’s cybersecurity robustness requirements for high-risk AI systems, and ISO 42001. All three require documented controls at the model, application, and context layers. The EU AI Act’s full high-risk AI system obligations take effect in August 2026.

8. Why is context-layer governance more effective than prompt filtering?

Prompt filtering operates at the model layer — it tries to detect and block malicious inputs before they reach the model. Sophisticated injection attacks can bypass prompt filters through indirect delivery, creative phrasing, or multi-step manipulation. Context-layer governance operates at the data access layer: it controls what the agent can retrieve and treat as truth, independently of what the model was instructed to do. Even a successfully injected agent cannot exfiltrate data it cannot access or override policies enforced at retrieval time.

Sources

OWASP Gen AI Security Project — LLM01:2025 Prompt Injection: https://genai.owasp.org/llmrisk/llm01-prompt-injection/
OWASP AI Agent Security Cheat Sheet: https://cheatsheetseries.owasp.org/cheatsheets/AI_Agent_Security_Cheat_Sheet.html
NIST AI 600-1 — Adversarial Machine Learning: A Taxonomy and Terminology of Attacks and Mitigations: https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.600-1.pdf
IBM Cost of a Data Breach Report 2025: https://www.ibm.com/reports/data-breach
PoisonedRAG: Knowledge Corruption Attacks to Retrieval-Augmented Generation (USENIX Security 2025): https://www.usenix.org/system/files/usenixsecurity25-zou-poisonedrag.pdf
Palo Alto Unit 42 — Fooling AI Agents: Web-Based Indirect Prompt Injection Observed in the Wild: https://unit42.paloaltonetworks.com/ai-agent-prompt-injection/
Google Security Blog — AI Threats in the Wild: The Current State of Prompt Injections on the Web: https://blog.google/security/prompt-injections-web/
MDPI Information — Prompt Injection Attacks in LLMs and AI Agent Systems: A Comprehensive Review: https://www.mdpi.com/2078-2489/17/1/54
EU AI Act — High-Level Summary: https://artificialintelligenceact.eu/high-level-summary/
CrowdStrike — Indirect Prompt Injection Attacks: Hidden AI Risks: https://www.crowdstrike.com/en-us/blog/indirect-prompt-injection-attacks-hidden-ai-risks/

Share this article

Atlan is the next-generation platform for data and AI governance. It is a control plane that stitches together a business's disparate data infrastructure, cataloging and enriching data with business context and security.

Book a Demo See Context Studio Live