What OpenAI's Data Agent Reveals About Enterprise AI

Q: What are the six layers of context in OpenAI's data agent?

The six layers are: (1) Table usage patterns from historical queries, (2) Human annotations capturing business meaning, (3) Codex enrichment extracting meaning from pipeline code, (4) Institutional knowledge from Slack/Docs/Notion, (5) Memory systems learning from corrections, (6) Runtime context via live warehouse queries. These ground the agent in organizational reality, not just schemas.

Q: Why do 95% of AI data agents fail in production?

Most fail because of the context gap—the space between what AI knows and what humans know but haven't documented. AI lacks tribal knowledge, edge cases, business definitions, and judgment calls. Without context infrastructure capturing and delivering this, agents produce confidently wrong answers. OpenAI engineered six context layers before deploying. Most enterprises skip this foundational work.

Quick answer: What is OpenAI's data agent?

OpenAI's internal data agent is a conversational AI system that lets employees analyze data using natural language. Built on GPT-5.2, it serves 3,500+ users across 600 petabytes of data and 70,000 datasets. It works because OpenAI engineered six distinct layers of context that ground the agent in organizational reality.

What makes it work:

Six context layers: Table usage, human annotations, automated code analysis, institutional knowledge, memory, and live validation
Continuous evaluation: Golden SQL queries that catch regressions before users notice
Built to think and work like a teammate: Refines unclear questions instead of failing silently
Self-correction: Fixes mistakes mid-analysis without user intervention

Below: what OpenAI built, the six context layers, enterprise challenges, three context types, evaluation systems.

Take AI-Readiness Assessment →Start the Atlan Product Tour

What OpenAI actually built: An agent that knows what humans know

OpenAI’s data agent writes SQL. So does every “talk to data” tool. The difference: it writes the right SQL.

3,500+ employees ask questions like “What was ChatGPT WAU on Oct 6, 2025 compared with DevDay 2023?” The agent returns accurate answers in minutes. When someone asks about “customer satisfaction in Europe for enterprise customers,” it knows “customer satisfaction” means AVG(csat_score) from customer_feedback—not customer_health_score from retention.

The agent self-corrects. Zero rows from a bad join? It investigates, adjusts, tries again. Users don’t see the failure.

It carries context across turns. Interrupt mid-analysis to redirect? It incorporates feedback and continues. No restart, no context loss.

The 180-line SQL problem

OpenAI’s article shows a screenshot: SQL spanning 180+ lines. Caption: “It’s not easy to know if we’re joining the right tables and querying the right columns.”

Even OpenAI’s engineers struggle with complexity. The problem isn’t writing SQL—it’s knowing if it’s correct. Many-to-many joins silently inflate row counts. Filter pushdown errors exclude intended data. Unhandled nulls change aggregates. These don’t throw errors. They just deliver wrong numbers.

That’s why OpenAI built evaluation systems. Without continuous testing, quality drifts unnoticed until users report conflicting numbers.

OpenAI’s six layers of context (and why each matters)

OpenAI didn’t build a smarter chatbot. They built context infrastructure.

Layer 1: Table usage patterns

The agent learns from historical queries. Analysts typically join customer_feedback.customer_id with customers.customer_id? The agent infers that relationship even if schemas don’t document it.

Layer 2: Human annotations

Domain experts write descriptions capturing intent, semantics, and caveats that schemas miss. Someone documents that csat_score means “Average satisfaction rating from post-support surveys” and excludes automated follow-ups. The agent can’t guess that.

Layer 3: Codex enrichment

The breakthrough. OpenAI uses Codex to crawl their codebase and extract table definitions, grain, primary keys, and freshness signals from actual pipeline code.

From the article: “Meaning lives in code. Pipeline logic captures assumptions, freshness guarantees, and business intent that never surface in SQL or metadata.”

Your dbt models, Airflow DAGs, Spark jobs contain critical context. They show whether tables exclude fields, refresh frequency, upstream transformations. This exists but stays locked in code.

Layer 4: Institutional knowledge

The agent accesses Slack, Docs, Notion to capture launches, incidents, codenames, metric definitions—context outside data systems.

Usage dipped in December? The agent references a Slack thread about logging issues starting November 13. It connects data patterns to organizational events.

Layer 5: Memory

User corrections get saved. Future queries start from accurate baselines instead of repeating mistakes.

Specific example from the article: “The agent didn’t know how to filter for a particular analytics experiment (it relied on matching against a specific string defined in an experiment gate). Memory was crucially important here.”

These edge cases exist everywhere. Memory captures them as they surface.

Layer 6: Runtime context

When context doesn’t exist or is stale, the agent queries the warehouse live to inspect schemas.

The counterintuitive lessons

“Less is more.” OpenAI exposed their full tool set and hit problems. Overlapping functionality confuses agents. Multiple ways to accomplish tasks—helpful for humans, harmful for AI. Fix: consolidate and simplify.

“Guide the goal, not the path.” Prescriptive prompting degraded results. Rigid instructions pushed agents down wrong paths. Better: describe outcomes, let GPT-5 reason about execution.

Most organizations do the opposite—elaborate prompt libraries, detailed step-by-step instructions. OpenAI found this approach actually degrades performance.

The heterogeneity problem

A single customer renewal decision requires context from six systems:

PagerDuty: Incident history (operational)
Zendesk: Escalation threads (operational)
Slack: VP approvals (operational)
Confluence: SOPs, runbooks (operational)
Salesforce: Deal records (analytical)
Snowflake: Usage, health metrics (analytical)

Different teams own each system. Each defines “customer” differently. Context scatters across tools that don’t communicate.

OpenAI built six context layers within one environment. Enterprises need the same context infrastructure across six or more disconnected systems.

The 70,000 dataset paradox

OpenAI struggles with scale. From their article, an internal user: “We have a lot of tables that are fairly similar, and I spend tons of time trying to figure out how they’re different and which to use. Some include logged-out users, some don’t.”

70,000 datasets in one platform, still disambiguation problems. Centralization trades “where is it?” for “which one is it?”

If OpenAI hits this with everything unified, imagine when data spans Snowflake, Databricks, BigQuery, S3, plus SaaS tools—each with different naming conventions.

The three types of context enterprises need

There are three foundational types of context your enterprise needs to build.

Context Type 1: Data context

Maps to OpenAI’s Table Usage and Runtime Context. Schemas, lineage, relationships, query patterns, live validation. What tables exist? How do they connect? What’s the grain? Who owns it?

Modern data catalogs with automated lineage provide this. Critical: make it accessible through APIs, not just human-readable pages.

Context Type 2: Analytical context

Maps to Human Annotations and Codex Enrichment. Business glossaries, metric definitions, semantic relationships, transformation logic.

Context graphs connect semantic meaning (“what does ‘customer’ mean?”) with operational lineage (“how does customer_id flow?”).

Unlike Codex parsing pipeline code, context graphs make business logic explicitly queryable. An agent doesn’t parse dbt models to learn “active customer” means subscription_status = ‘active’ AND last_activity > 30 days. That relationship exists in the graph.

Context Type 3: Operational context

Maps to Institutional Knowledge and Memory. Decision traces, approval history, edge cases, “except when…” rules.

Hardest layer—undocumented by nature. OpenAI captures reactively through corrections. Enterprises need feedback loops plus proactive structured decision logging.

From guessing to understanding: What deep context delivers

Working with Workday on real test cases with deep context engineering: 5x accuracy improvement.

Not 5% better. Five times better.

Without deep context:
The AI agent answered “How many enterprise customers renewed last quarter?” by counting every account with renewal_date in range. Wrong—missed that enterprise means contract_value > $100K AND employee_count > 500, and “renewed” excludes trials converting to paid.

With deep context:
Same question, agent applies correct filters because context graphs encode: enterprise = contract tier + employee threshold, renewed = specific status codes excluding trial conversions, last quarter = fiscal calendar (not calendar year).

The difference: being understood, not just heard. Deep context means agents stop guessing and start reasoning. They distinguish similar-looking tables, apply correct filters, understand organizational meaning beyond column names.

Why evaluation systems matter as much as agents

OpenAI treats their agent like production code. Most enterprises treat AI agents like experiments.

OpenAI built curated question-answer pairs. Each question targets important metrics. Each includes manually authored “golden” SQL producing expected results.

Every evaluation:

Send question to agent
Execute generated SQL
Compare against expected result
Feed both to Evals API for scoring

From the article: “Evals are like unit tests that run continuously during development to identify regressions as canaries in production.”

Why agents fail without evaluation

Deploying without evaluation: code without tests. Quality drifts invisibly. Prompt changes break capabilities. Model updates introduce failures. Nobody notices until users complain about conflicting numbers.

Investment is real. Curate Q&A pairs. Write expected SQL. Build comparison logic handling syntactic differences while catching semantic errors.

It’s the difference between agents that improve versus degrade.

What to measure

Mapping accuracy: Right fields and relationships? “Customer satisfaction” → CSAT table or NPS table or retention score?

Coverage: Percentage of questions handled confidently? Does it know when to say “I lack context for accurate answers”?

Consistency: Same answer to same question? Or variation based on which tables context retrieval found first?

Confidence calibration: When confident, is it right? When hedging, genuinely unsure?

OpenAI’s threshold from the article: 80% accuracy, 70% consistency before deployment. Anything less means shipping agents that confidently deliver wrong answers.

What this means for data teams

OpenAI’s success proves AI can transform data interaction. But the article reveals what matters more: what had to exist before AI could work.

Six context layers. Continuous evaluation. Self-improving memory. Infrastructure making organizational knowledge machine-readable.

The bottleneck isn’t the model. It’s the context layer underneath.

This changes the conversation. Not “which AI agent?” but “do we have context infrastructure to make any agent work reliably?”

Most don’t. They have documentation six months stale, business definitions varying by department, tribal knowledge locked in heads, data scattered across non-communicating systems, no systematic testing of AI correctness.

Building context infrastructure means solving these first. Making metadata active, not passive. Capturing business logic in graphs, not Confluence. Treating context as product, not byproduct.

The gap between proof-of-concept and production: layers of context.

That infrastructure requires investment—tools, yes, but more importantly how data teams work. Context engineering becomes core competency. Evaluation systems become mandatory. Memory and feedback loops become standard.

Alternative: join the 95% of AI pilots that fail. Not because models aren’t good enough. Not because teams aren’t smart enough. But because context infrastructure doesn’t exist to make agents reliable.

OpenAI’s article isn’t a blueprint for their exact system. It’s proof that context infrastructure matters more than model choice. Validation that data team work—documenting metadata, building glossaries, capturing lineage—isn’t just governance. It’s the foundation making AI possible.

FAQs about OpenAI’s data agent and enterprise AI analytics

1. How is OpenAI’s internal data agent different from ChatGPT?

OpenAI’s internal data agent is purpose-built for analyzing company data using natural language. Unlike ChatGPT providing general knowledge, the data agent is grounded in OpenAI’s specific tables, metrics, business logic, and institutional knowledge through six context layers. It serves 3,500+ employees across 600 petabytes of proprietary data.

2. What are the six layers of context in OpenAI’s data agent?

The six layers: (1) Table usage patterns from historical queries, (2) Human annotations capturing business meaning, (3) Codex enrichment extracting meaning from pipeline code, (4) Institutional knowledge from Slack/Docs/Notion, (5) Memory systems learning from corrections, (6) Runtime context via live warehouse queries. These ground the agent in organizational reality, not just schemas.

3. Why do 95% of AI data agents fail in production?

Most fail because of the context gap—space between what AI knows and what humans know but haven’t documented. AI lacks tribal knowledge, edge cases, business definitions, judgment calls. Without context infrastructure capturing and delivering this, agents produce confidently wrong answers. OpenAI engineered six context layers before deploying. Most enterprises skip foundational work.

4. What is context engineering for AI analysts?

Context engineering builds infrastructure connecting data definitions, metrics, and business rules to AI in machine-readable formats. Rather than documenting data, it makes organizational knowledge accessible through semantic layers, metadata graphs, and automated enrichment. Ensures AI agents understand “customer” or “revenue” identically to humans—with nuances, edge cases, exceptions.

5. What are the three types of context enterprises need for AI agents?

Enterprises need three context types: (1) Data context—schemas, lineage, relationships, query patterns, live validation; (2) Analytical context—business glossaries, metric definitions, semantic relationships, transformation logic; (3) Operational context—decision traces, approval history, edge cases, and exception rules. These map to OpenAI’s six layers but are architected for distributed enterprise environments.

Share this article

Atlan is the next-generation platform for data and AI governance. It is a control plane that stitches together a business's disparate data infrastructure, cataloging and enriching data with business context and security.

Book a Demo Start Tour

Context Graph: What It Is, How It Works, & Implementation Guide
AI Readiness: Complete Guide to Assessment and Implementation
Context Engineering for AI Analysts
What Is a Semantic Layer? Definition, Types, Components, and Implementation Guide
Data Governance for AI
AI Data Catalog: Its Everything You Hoped For & More
9 Best Data Lineage Tools: Critical Features, Use Cases & Innovations
Data Lineage Solutions: Capabilities and 2026 Guidance
12 Best Data Catalog Tools in 2026 | A Complete Roundup of Key Capabilities
Dynamic Metadata Discovery Explained: How It Works, Top Use Cases & Implementation in 2026
Metadata Orchestration: How Does It Drive Governance and Trustworthy AI Outcomes in 2026?
What Is Metadata Analytics & How Does It Work? Concept, Benefits & Use Cases for 2026