Build Your AI Context Stack
Get the blueprint for implementing context graphs across your enterprise. Four-layer architecture from metadata foundation to agent orchestration, with practical implementation steps for 2026.
Get the Stack GuideWhat is an AI gateway?
Permalink to “What is an AI gateway?”An AI gateway — sometimes called an LLM gateway — is a middleware layer that sits between your applications or AI agents and the LLM providers they call. Think of it as a traffic controller for model API requests: it decides which model to route each request to, enforces rate limits and spending budgets, manages API keys so applications never touch raw provider credentials, logs every call for observability, caches semantically equivalent prompts to reduce costs, and applies guardrails to filter harmful or policy-violating content before it reaches the model.
The term “AI gateway” and “LLM gateway” are used interchangeably in the industry, though “AI gateway” has become the broader label as these tools have expanded beyond pure language model routing to cover vision models, embedding APIs, and agentic workflow orchestration.
Why AI gateways emerged
Permalink to “Why AI gateways emerged”Before AI gateways, teams called each LLM provider directly: OpenAI here, Anthropic there, AWS Bedrock for regulated workloads. Each team managed its own API keys, set its own budgets, wrote its own retry logic, and maintained its own logging. This created fragmented visibility, unpredictable costs, and no consistent governance across model usage.
Industry analysts tracking this space project that by 2028, 70% of software engineering teams building multimodel applications will use AI gateways — up from roughly 25% in 2025, based on growth trend analysis reported by TrueFoundry and others covering Gartner research. The driver is straightforward: enterprise LLM API spend reached $12.5 billion in 2025 (per Menlo Ventures), and 53% of AI teams report costs exceeding forecasts by 40% or more during scaling. Without centralized control, costs and risks accumulate fast.
How an AI gateway fits into the stack
Permalink to “How an AI gateway fits into the stack”Architecturally, an AI gateway operates as a reverse proxy for model APIs. Applications send requests to the gateway’s single endpoint — often an OpenAI-compatible API — and the gateway handles provider selection, auth, and policy enforcement transparently. This means a team can switch from GPT-4o to Claude 3.7 Sonnet to Mistral without changing a single line of application code.
In 2026, with agentic AI in production across most large enterprises, AI gateways have evolved beyond simple request proxies. A single user request may trigger Claude Opus for complex reasoning, Haiku for fast classification, an MCP server for data retrieval, and a second model for output verification — all chained in one workflow. AI gateways now route and govern that entire chain.
Core capabilities of an AI gateway
Permalink to “Core capabilities of an AI gateway”Multi-model routing and fallback
Permalink to “Multi-model routing and fallback”The foundation of any AI gateway is a unified routing layer. All model providers — OpenAI, Anthropic, Google Vertex, AWS Bedrock, Mistral, Cohere, and dozens more — are accessible through a single, stable endpoint. The gateway maintains provider configurations centrally; applications interact with one API regardless of which model sits behind it.
Routing logic goes beyond simple forwarding. Gateways support:
- Load balancing across multiple deployments of the same model
- Latency-based routing that selects the fastest available provider at request time
- Cost-based routing that directs requests to cheaper models when quality requirements allow
- Fallback chains triggered automatically when a provider returns errors (429 rate-limit, 500 server error) — routing to an alternate provider with zero downtime and no manual intervention
Rate limiting and token budget control
Permalink to “Rate limiting and token budget control”Traditional API gateways enforce rate limits on request counts. AI gateways enforce limits on token consumption — the actual billing unit for LLMs. This distinction matters: a single request can cost 100× more than another depending on prompt and response length.
Enterprise AI gateways typically support multi-tier budget enforcement:
- Per-key limits — each application or service account has its own token ceiling
- Per-team limits — budget pools assigned to business units or product teams
- Per-project limits — spending caps per use case or deployment
- Global limits — organization-wide hard ceilings to prevent unchecked spend
When a limit is reached, the gateway returns a 429 with configurable retry-after headers rather than letting requests fail unpredictably at the provider level.
Auth and API key management
Permalink to “Auth and API key management”Centralizing API key management is one of the highest-value capabilities an AI gateway provides. Instead of each team holding live provider credentials, the gateway holds all provider keys and issues scoped virtual keys to teams and applications. Benefits include:
- Zero credential exposure in application code or environment variables
- Instant key revocation without touching application deployments
- Scoped permissions — a virtual key can be restricted to specific models, budgets, or request types
- Rotation without downtime — provider keys rotate at the gateway without application changes
This directly addresses one of the most common enterprise AI security risks: API keys embedded in repositories or distributed across dozens of engineers.
Logging, observability, and cost tracking
Permalink to “Logging, observability, and cost tracking”Every request through an AI gateway is logged with structured metadata: which user sent it, which model handled it, how many tokens were consumed (input and output separately), latency at each step, cost in dollars, and which fallback (if any) was triggered. This produces a unified audit trail across all AI usage.
Production-grade gateways expose these metrics via Prometheus and OpenTelemetry, integrating with existing observability stacks (Datadog, Grafana, Splunk). Teams get per-model cost dashboards, per-team attribution, and latency percentile views without instrumenting each application separately.
Semantic caching
Permalink to “Semantic caching”Standard API caching returns cached responses for byte-for-byte identical requests. Semantic caching applies vector similarity: if two prompts ask the same thing in different words, the gateway returns the cached response from the first. Teams using dual-layer semantic caching report 40%+ cache hit rates in production (per Bifrost benchmarks), directly reducing spend and latency without degrading output quality.
Under the hood, semantic caching stores prompt embeddings in a vector store (Weaviate, Qdrant, or Pinecone are common choices). Incoming prompts are embedded and compared against cached entries; above a similarity threshold, the gateway serves the cache. Below threshold, it calls the model and caches the new response.
Content guardrails and prompt filtering
Permalink to “Content guardrails and prompt filtering”AI gateways apply rule-based and model-based checks to requests before they reach the LLM:
- Prompt injection detection — flags inputs that attempt to override system instructions
- PII detection in inputs — redacts or blocks sensitive data before it leaves your environment
- Topic and content filtering — blocks off-policy request categories (e.g., requests outside approved use cases)
- Output filtering — validates model responses before returning them to applications
These controls implement AI TRiSM (AI Trust, Risk, and Security Management) policy at the infrastructure layer, without requiring each application team to build its own safety logic.
AI gateway vs API gateway: key differences
Permalink to “AI gateway vs API gateway: key differences”A common point of confusion is whether an existing API gateway (Kong, NGINX, AWS API Gateway) can serve the same purpose as an AI gateway. The short answer is: they handle different kinds of traffic and are designed for fundamentally different requirements.
| Dimension | Traditional API gateway | AI gateway |
|---|---|---|
| Traffic unit | HTTP requests | Tokens / completions / streams |
| Rate limiting | Requests per second or per minute | Tokens per minute per model |
| Cost tracking | Limited — custom metrics required; no native token accounting | Per-token spend by team, model, and user |
| Caching | Exact URL/body match | Semantic similarity (vector-based) |
| Security | Auth, TLS, WAF | Auth + prompt injection, PII, content safety |
| Routing logic | Path-based, header-based | Model-based, latency-based, cost-based |
| Streaming | Limited | Native SSE and WebSocket support |
| Observability | Request/response logs | Token usage, model traces, per-step cost |
Traditional API gateways were designed for REST and gRPC microservices. They have no concept of tokens, no awareness of streaming completion protocols, and no semantic understanding of request payloads. An AI gateway is purpose-built for the LLM layer.
Most enterprises end up running both: the traditional API gateway handles service-to-service and client-to-server traffic; the AI gateway handles all model API traffic. They are complementary infrastructure, not competing tools.
Popular AI gateways compared
Permalink to “Popular AI gateways compared”Inside Atlan AI Labs & The 5x Accuracy Factor
Learn how context engineering drove 5x AI accuracy in real customer systems. Explore real experiments, quantifiable results, and a repeatable playbook for closing the gap between AI demos and production-ready systems.
Download E-BookThe AI gateway market has matured quickly. Here is a practical overview of the major options enterprises evaluate.
LiteLLM
Permalink to “LiteLLM”LiteLLM is the most widely adopted open-source LLM gateway. It provides an OpenAI-compatible proxy API across 100+ providers, making it easy to standardize model calls across OpenAI, Anthropic, Mistral, Cohere, AWS Bedrock, and more. Setup is fast and the provider coverage is comprehensive.
The tradeoffs appear at scale. Under sustained loads of 2,000+ requests per second, LiteLLM has exhibited memory usage exceeding 8 GB and cascading timeouts. For regulated or multi-team environments, it requires significant additional tooling for governance and access control.
Best for: teams starting multi-model routing, self-hosted environments, cost-sensitive builds.
Portkey
Permalink to “Portkey”Portkey positions itself as a full AI control plane rather than just a router. Every request through Portkey generates a complete trace: which user made the call, which model was tried, why fallback was triggered, how long each step took, and the exact cost. This visibility is genuinely useful for debugging agentic workflows.
Portkey’s design is optimized for application-level teams and begins to show constraints when multiple enterprise teams need isolated governance and budget pools.
Best for: single-team LLM applications moving toward production, teams prioritizing per-request observability.
Kong AI Gateway
Permalink to “Kong AI Gateway”Kong extends its enterprise API gateway platform with AI-specific plugins: AI rate limiting on tokens (not just requests), semantic caching, model routing, and MCP server auto-generation. According to Kong’s published benchmarks, Kong AI Gateway outperformed Portkey by over 200% and LiteLLM by over 800% under load — though as with all vendor-produced benchmarks, these should be verified against your own workload profile.
Kong is the natural choice for enterprises already standardized on Kong for API management — it adds the AI layer to existing infrastructure rather than requiring a parallel deployment.
Best for: enterprise teams at scale, organizations already using Kong for API management.
AWS Bedrock
Permalink to “AWS Bedrock”AWS Bedrock provides managed, serverless access to Anthropic Claude, Amazon Titan, Meta Llama, Mistral, and other foundation models. Infrastructure is fully abstracted; billing is on token usage. Deep integration with AWS IAM, CloudWatch, VPC endpoints, and PrivateLink makes it the lowest-friction option for AWS-native enterprises.
The limitation is platform lock-in: routing beyond Bedrock-hosted models requires additional tooling, and the control plane stays within the AWS ecosystem.
Best for: AWS-native enterprises wanting managed model access without infrastructure overhead.
Azure AI Gateway
Permalink to “Azure AI Gateway”Azure AI Gateway (part of Azure AI Foundry and Azure API Management) provides routing, load balancing, and governance for Azure OpenAI and other Azure AI services. Integration with Entra ID, Azure Monitor, and Azure Policy makes it a natural fit for Microsoft-standardized enterprises.
Like Bedrock, the primary constraint is ecosystem dependency — it works best for organizations already invested in the Microsoft Azure stack.
Best for: Azure-native enterprises, organizations with existing Azure OpenAI deployments.
Databricks AI Gateway (Unity AI Gateway)
Permalink to “Databricks AI Gateway (Unity AI Gateway)”Databricks embeds AI gateway functionality within Unity Catalog, providing model governance as part of the Lakehouse governance layer. This gives Databricks customers a single plane for data governance and model governance — tables, pipelines, and models all governed in one place.
The constraint is the same as the others: workloads and governance remain tied to the Databricks control plane. Cross-vendor flexibility outside Databricks-hosted models is limited.
Best for: Databricks customers wanting unified data + AI governance within the Lakehouse.
Bifrost (by Maxim AI)
Permalink to “Bifrost (by Maxim AI)”Bifrost is a high-performance, Go-based open-source AI gateway built for production throughput. It delivers under 11 microseconds of gateway overhead at 5,000 requests per second — roughly 50× lower than LiteLLM under comparable load. Bifrost supports dual-layer semantic caching (hash + vector) and a four-tier token budget system.
Best for: high-throughput production teams where gateway latency and overhead are primary constraints.
How to choose an AI gateway for enterprise
Permalink to “How to choose an AI gateway for enterprise”Choosing an AI gateway comes down to five dimensions:
1. Provider coverage — Does it support all models you run today and are likely to run in the next 18 months? Multi-cloud enterprises need broad coverage; AWS-native teams may be fine with Bedrock.
2. Throughput and latency — At what request volume does the gateway start degrading? For teams under 500 RPS, most options are fine. At 2,000+ RPS, only Kong, Bifrost, and managed cloud options hold up under benchmarks.
3. Governance depth — Does the gateway support per-team virtual keys, multi-tier budget pools, and audit log export? This matters more as AI usage scales across multiple teams and use cases.
4. Ecosystem fit — If your organization already runs Kong, integrating Kong AI Gateway is lower friction than deploying a separate tool. Same logic applies for AWS Bedrock and Azure AI Gateway in their respective cloud ecosystems.
5. Open-source vs managed — Open-source gateways (LiteLLM, Bifrost) give you full control and no per-token gateway markup; managed options (Bedrock, Azure) trade control for lower operational burden.
One dimension that no gateway comparison addresses — and the one that matters most for production accuracy — is covered in the next section.
The context gap: what AI gateways don’t solve
Permalink to “The context gap: what AI gateways don’t solve”Here is where most enterprise AI programs run into a wall that their AI gateway was never designed to fix.
An AI gateway governs how models are called. It does not govern what is sent to them.
When a data analyst asks an AI agent “which customers are at risk of churning,” the gateway routes the request to the right model, enforces the budget, logs the call, and returns the response. What the gateway cannot tell you:
- Was the customer data in that prompt accurate and fresh?
- Were the business definitions of “churn” and “at risk” the same ones finance uses?
- Did PII fields get masked before entering the prompt?
- Does lineage trace from the answer back to the source tables it was derived from?
- Did the agent have permission to access that specific customer segment?
These questions are about context governance — and they live in a layer beneath the gateway. The gateway is traffic management; context governance is about what’s in the payload.
Some platform-native tools (Databricks Unity Catalog, Azure AI Foundry) offer partial data governance overlap within their respective ecosystems. The limitation is scope: they govern data within their platform boundaries, not across the heterogeneous 50-to-200-system data estates that most enterprises actually operate. A governed context layer needs to span all sources — not just one platform.
Enterprises that skip context governance discover this gap in production: the model receives stale data, conflicting definitions, or unmasked PII, and outputs that look plausible but are factually wrong or policy-violating. The gateway did its job; the context layer did not exist.
The AI gateway + context layer stack
Permalink to “The AI gateway + context layer stack”The CIO's Guide to Context Graphs
Discover the key strategies that CIOs are using to implement context layers and scale AI.
Get the GuideThe full production AI stack pairs the gateway with a governed context layer:
AI gateway (pick any): Handles which model gets called, with what budget, under what rate limits. Controls the infrastructure pipe.
Context layer (Atlan): Governs what information enters that pipe — semantic definitions, data lineage, PII classifications, access policies, quality signals, and usage metadata. Enforces governance before context reaches the model, not after.
The relationship is complementary, not competitive. As Atlan’s positioning puts it: “Use whatever AI gateway you like; Atlan is the governed context layer behind it that tells the model what it’s allowed to see, what it should trust, and how that context traces back to your data and policies.”
Atlan’s context layer provides:
- Metadata lakehouse + context graph — unifies schema, lineage, glossary, policies, quality, and usage across 100+ systems into a single governed graph
- MCP server — exposes the entire context graph to any MCP-compatible agent as standard tool calls, regardless of which LLM or gateway sits in front
- Runtime governance — policy-triggered access control and PII masking enforced at context delivery time, not as an afterthought
- Compounding context — living context layer with lineage, versioning, and feedback loops, not one-off retrieval
The result: LLM gateways decide how to call models. Atlan decides what governed context those models see — with lineage, policies, and quality built in. You can swap gateways or models; your context layer stays the same.
Atlan customer deployments have reported up to 5× improvement in AI accuracy when grounding models in governed enterprise context, documented in Atlan AI Labs. The accuracy gains do not come from a better model or a better gateway — they come from better context.
For a deeper look at context infrastructure, see Context Infrastructure for AI Agents and Context Engineering and AI Governance.
How the stack layers work together
Permalink to “How the stack layers work together”┌─────────────────────────────────────────────────┐
│ AI Applications │
│ (analysts, agents, product features) │
└─────────────────┬───────────────────────────────┘
│ model API calls
┌─────────────────▼───────────────────────────────┐
│ AI Gateway │
│ routing · rate limiting · auth · cost · cache │
│ (LiteLLM, Kong, Portkey, Bedrock, Azure…) │
└─────────────────┬───────────────────────────────┘
│ governed context delivery
┌─────────────────▼───────────────────────────────┐
│ Context Layer (Atlan) │
│ metadata graph · lineage · glossary · PII │
│ policies · quality signals · MCP server │
└─────────────────┬───────────────────────────────┘
│ governed data access
┌─────────────────▼───────────────────────────────┐
│ Data Platforms │
│ Snowflake · Databricks · BigQuery · dbt │
└─────────────────────────────────────────────────┘
AI Applications call models to answer questions, generate content, and take actions.
AI Gateway standardizes model API access — routing, budgeting, key management, logging. It ensures the right model gets called, at the right cost, with the right observability.
Context Layer (Atlan) governs what those models see. Before any context reaches the model, Atlan enforces: which assets this agent can access, which fields are PII-masked, what the official business definition of a metric is, and what the lineage trace looks like for every piece of data in the prompt.
Data Platforms are the source systems — warehouses, lakehouses, transformation pipelines — that produce the data the context layer governs.
Real stories from real customers: governing context at enterprise scale
Permalink to “Real stories from real customers: governing context at enterprise scale”"AI initiatives require more context than ever. Atlan's metadata lakehouse is configurable, intuitive, and able to scale to hundreds of millions of assets. As we're doing this, we're making life easier for data scientists and speeding up innovation."
— Andrew Reiskind, Chief Data Officer, Mastercard
"We're excited to build the future of AI governance with Atlan. All of the work that we did to get to a shared language at Workday can be leveraged by AI via Atlan's MCP server…as part of Atlan's AI Labs, we're co-building the semantic layer that AI needs with new constructs, like context products."
— Joe DosSantos, VP of Enterprise Data & Analytics, Workday
The AI gateway is necessary — but not sufficient
Permalink to “The AI gateway is necessary — but not sufficient”An AI gateway is essential infrastructure for any enterprise operating AI at scale. It solves real, costly problems: runaway spending, fragmented API key management, lack of visibility into who is calling what model, and zero fallback when a provider goes down. For teams building or scaling LLM applications, adding an AI gateway is not optional — it is the minimum viable control plane for production.
But the gateway is the traffic layer, not the intelligence layer. It controls the pipe; it does not govern the payload. The most consequential question in enterprise AI is not “which model did we call?” It is “was the context we sent that model accurate, governed, and traceable?” Answering that question requires the context layer.
Atlan and an AI gateway are complementary, not competing. The gateway (LiteLLM, Kong, Portkey, Bedrock, Azure, Databricks — pick the one that fits your stack) handles the infrastructure pipe. Atlan handles the governed context layer underneath: what data agents can access, what business definitions they operate with, what policies apply to each request, and what lineage exists for every piece of context delivered to the model.
Together, they form the architecture that takes AI from convincing demo to production-grade system. To explore what Atlan’s context layer looks like in practice, see What Is Atlan MCP? and Context Infrastructure for AI Agents.
Frequently asked questions about AI gateways
Permalink to “Frequently asked questions about AI gateways”1. What is the difference between an AI gateway and a traditional API gateway?
A traditional API gateway manages HTTP traffic between microservices — routing, authentication, and request-level rate limiting. An AI gateway is purpose-built for LLM traffic: it enforces token-based rate limits that match how providers bill, tracks per-model and per-team spend, performs semantic caching on prompt similarity rather than exact URL matches, handles streaming via SSE and WebSocket, and applies content-level security like prompt injection detection. Most enterprises need both — the API gateway for service traffic, the AI gateway for model traffic.
2. What is an LLM gateway?
An LLM gateway is the same category of tool as an AI gateway — a middleware layer between applications and large language model APIs. The terms are interchangeable, with “AI gateway” becoming the broader industry label as these tools have expanded to cover not just language models but also embedding APIs, vision models, and agentic workflows involving MCP servers.
3. Do I need an AI gateway if I only use one LLM provider?
Yes, for most enterprise teams. Even with a single provider, an AI gateway provides: centralized API key management (no raw keys in application code), per-team and per-project budget enforcement, unified logging and cost attribution, fallback handling if that provider has downtime, and semantic caching to reduce spend. The cost and governance benefits apply regardless of provider count.
4. Can an AI gateway prevent hallucinations?
Partially. AI gateways can apply guardrails that filter known harmful or off-policy outputs. But hallucinations that result from incomplete or inaccurate context in the prompt are not addressable at the gateway layer — the gateway does not know whether the context being sent is accurate. Preventing hallucinations driven by bad context requires a governed context layer that delivers verified, lineage-traced, semantically accurate information to the model before the gateway call is made.
5. How does semantic caching work in an AI gateway?
Semantic caching stores vector embeddings of prompts in a vector database (Weaviate, Qdrant, or Pinecone). When a new prompt arrives, the gateway embeds it and compares it to cached entries. If similarity exceeds a configured threshold, the gateway returns the cached response without calling the model. This catches semantically equivalent prompts even when they are worded differently. Vendors offering dual-layer caching (exact hash + semantic vector match) report 40%+ cache hit rates in production deployments.
6. What is the difference between an AI gateway and a context layer?
An AI gateway controls how model API calls are made — routing, budget, auth, logging. A context layer governs what information those calls contain: business definitions, data lineage, PII classifications, access policies, and quality signals. The gateway is the transport layer; the context layer is the governance layer for the payload. Atlan is the governed context layer that sits behind any AI gateway, ensuring every model call receives accurate, policy-compliant, lineage-traced context.
7. Which AI gateway is best for enterprise?
The best choice depends on your stack: Kong AI Gateway leads on throughput and enterprise governance depth; Portkey leads on per-request observability for application teams; LiteLLM leads on open-source flexibility and broad provider support; AWS Bedrock and Azure AI Gateway are strongest for teams already standardized on those cloud platforms; Databricks Unity AI Gateway is best for Databricks customers wanting unified data and model governance. Bifrost leads on raw performance for high-throughput workloads.
8. Where does Atlan fit in the AI gateway stack?
Atlan is not an AI gateway — it is the governed context layer that operates below the gateway. An AI gateway routes requests and enforces operational policies (rate limits, budgets, auth). Atlan governs what context enters those requests: which data assets are accessible, which fields are PII-masked, what the official business definitions are, and what lineage exists for every piece of information. Atlan integrates with any AI gateway via its MCP server, providing a standard interface for governed context delivery regardless of which gateway or model framework you run.
Sources
Permalink to “Sources”- ngrok: What are AI gateways in 2026, and do you actually need one now?
- TrueFoundry: A Definitive Guide to AI Gateways in 2026: Competitive Landscape Comparison
- Kong Inc: What Is an AI Gateway?
- Kong Inc: API Gateway vs AI Gateway — Key Differences
- Helicone: Top 5 LLM Gateways in 2025 — The Complete Guide
- Maxim AI: Best LLM Gateways in 2025 — Features, Benchmarks, Builder’s Guide
- Menlo Ventures: The State of Generative AI in the Enterprise 2025
- Datadog: State of AI Engineering
- TrueFoundry: Rate Limiting in AI Gateway — The Ultimate Guide
- Atlan: Context Infrastructure for AI Agents
Share this article
