What OpenAI's Data Agent Reveals About Enterprise AI

author-img
by Emily Winks, Data governance expert at Atlan.Last Updated on: January 30th, 2026 | 9 min read

Quick answer: What is OpenAI's data agent?

OpenAI's internal data agent is a conversational AI system that lets employees analyze data using natural language. Built on GPT-5.2, it serves 3,500+ users across 600 petabytes of data and 70,000 datasets. It works because OpenAI engineered six distinct layers of context that ground the agent in organizational reality.

What makes it work:

  • Six context layers: Table usage, human annotations, automated code analysis, institutional knowledge, memory, and live validation
  • Continuous evaluation: Golden SQL queries that catch regressions before users notice
  • Built to think and work like a teammate: Refines unclear questions instead of failing silently
  • Self-correction: Fixes mistakes mid-analysis without user intervention

Below: what OpenAI built, the six context layers, enterprise challenges, three context types, evaluation systems.


What OpenAI actually built: An agent that knows what humans know

Permalink to “What OpenAI actually built: An agent that knows what humans know”

OpenAI’s data agent writes SQL. So does every “talk to data” tool. The difference: it writes the right SQL.

3,500+ employees ask questions like “What was ChatGPT WAU on Oct 6, 2025 compared with DevDay 2023?” The agent returns accurate answers in minutes. When someone asks about “customer satisfaction in Europe for enterprise customers,” it knows “customer satisfaction” means AVG(csat_score) from customer_feedback—not customer_health_score from retention.

The agent self-corrects. Zero rows from a bad join? It investigates, adjusts, tries again. Users don’t see the failure.

It carries context across turns. Interrupt mid-analysis to redirect? It incorporates feedback and continues. No restart, no context loss.

The 180-line SQL problem

Permalink to “The 180-line SQL problem”

OpenAI’s article shows a screenshot: SQL spanning 180+ lines. Caption: “It’s not easy to know if we’re joining the right tables and querying the right columns.”

Even OpenAI’s engineers struggle with complexity. The problem isn’t writing SQL—it’s knowing if it’s correct. Many-to-many joins silently inflate row counts. Filter pushdown errors exclude intended data. Unhandled nulls change aggregates. These don’t throw errors. They just deliver wrong numbers.

That’s why OpenAI built evaluation systems. Without continuous testing, quality drifts unnoticed until users report conflicting numbers.


OpenAI’s six layers of context (and why each matters)

Permalink to “OpenAI’s six layers of context (and why each matters)”

OpenAI didn’t build a smarter chatbot. They built context infrastructure.

Layer 1: Table usage patterns

The agent learns from historical queries. Analysts typically join customer_feedback.customer_id with customers.customer_id? The agent infers that relationship even if schemas don’t document it.

Layer 2: Human annotations

Domain experts write descriptions capturing intent, semantics, and caveats that schemas miss. Someone documents that csat_score means “Average satisfaction rating from post-support surveys” and excludes automated follow-ups. The agent can’t guess that.

Layer 3: Codex enrichment

The breakthrough. OpenAI uses Codex to crawl their codebase and extract table definitions, grain, primary keys, and freshness signals from actual pipeline code.

From the article: “Meaning lives in code. Pipeline logic captures assumptions, freshness guarantees, and business intent that never surface in SQL or metadata.”

Your dbt models, Airflow DAGs, Spark jobs contain critical context. They show whether tables exclude fields, refresh frequency, upstream transformations. This exists but stays locked in code.

Layer 4: Institutional knowledge

The agent accesses Slack, Docs, Notion to capture launches, incidents, codenames, metric definitions—context outside data systems.

Usage dipped in December? The agent references a Slack thread about logging issues starting November 13. It connects data patterns to organizational events.

Layer 5: Memory

User corrections get saved. Future queries start from accurate baselines instead of repeating mistakes.

Specific example from the article: “The agent didn’t know how to filter for a particular analytics experiment (it relied on matching against a specific string defined in an experiment gate). Memory was crucially important here.”

These edge cases exist everywhere. Memory captures them as they surface.

Layer 6: Runtime context

When context doesn’t exist or is stale, the agent queries the warehouse live to inspect schemas.

The counterintuitive lessons

Permalink to “The counterintuitive lessons”

“Less is more.” OpenAI exposed their full tool set and hit problems. Overlapping functionality confuses agents. Multiple ways to accomplish tasks—helpful for humans, harmful for AI. Fix: consolidate and simplify.

“Guide the goal, not the path.” Prescriptive prompting degraded results. Rigid instructions pushed agents down wrong paths. Better: describe outcomes, let GPT-5 reason about execution.

Most organizations do the opposite—elaborate prompt libraries, detailed step-by-step instructions. OpenAI found this approach actually degrades performance.


The heterogeneity problem

Permalink to “The heterogeneity problem”

A single customer renewal decision requires context from six systems:

  • PagerDuty: Incident history (operational)
  • Zendesk: Escalation threads (operational)
  • Slack: VP approvals (operational)
  • Confluence: SOPs, runbooks (operational)
  • Salesforce: Deal records (analytical)
  • Snowflake: Usage, health metrics (analytical)

Different teams own each system. Each defines “customer” differently. Context scatters across tools that don’t communicate.

OpenAI built six context layers within one environment. Enterprises need the same context infrastructure across six or more disconnected systems.

The 70,000 dataset paradox

Permalink to “The 70,000 dataset paradox”

OpenAI struggles with scale. From their article, an internal user: “We have a lot of tables that are fairly similar, and I spend tons of time trying to figure out how they’re different and which to use. Some include logged-out users, some don’t.”

70,000 datasets in one platform, still disambiguation problems. Centralization trades “where is it?” for “which one is it?”

If OpenAI hits this with everything unified, imagine when data spans Snowflake, Databricks, BigQuery, S3, plus SaaS tools—each with different naming conventions.


The three types of context enterprises need

Permalink to “The three types of context enterprises need”

There are three foundational types of context your enterprise needs to build.

Context Type 1: Data context

Maps to OpenAI’s Table Usage and Runtime Context. Schemas, lineage, relationships, query patterns, live validation. What tables exist? How do they connect? What’s the grain? Who owns it?

Modern data catalogs with automated lineage provide this. Critical: make it accessible through APIs, not just human-readable pages.

Context Type 2: Analytical context

Maps to Human Annotations and Codex Enrichment. Business glossaries, metric definitions, semantic relationships, transformation logic.

Context graphs connect semantic meaning (“what does ‘customer’ mean?”) with operational lineage (“how does customer_id flow?”).

Unlike Codex parsing pipeline code, context graphs make business logic explicitly queryable. An agent doesn’t parse dbt models to learn “active customer” means subscription_status = ‘active’ AND last_activity > 30 days. That relationship exists in the graph.

Context Type 3: Operational context

Maps to Institutional Knowledge and Memory. Decision traces, approval history, edge cases, “except when…” rules.

Hardest layer—undocumented by nature. OpenAI captures reactively through corrections. Enterprises need feedback loops plus proactive structured decision logging.

From guessing to understanding: What deep context delivers

Permalink to “From guessing to understanding: What deep context delivers”

Working with Workday on real test cases with deep context engineering: 5x accuracy improvement.

Not 5% better. Five times better.

Without deep context:
The AI agent answered “How many enterprise customers renewed last quarter?” by counting every account with renewal_date in range. Wrong—missed that enterprise means contract_value > $100K AND employee_count > 500, and “renewed” excludes trials converting to paid.

With deep context:
Same question, agent applies correct filters because context graphs encode: enterprise = contract tier + employee threshold, renewed = specific status codes excluding trial conversions, last quarter = fiscal calendar (not calendar year).

The difference: being understood, not just heard. Deep context means agents stop guessing and start reasoning. They distinguish similar-looking tables, apply correct filters, understand organizational meaning beyond column names.


Why evaluation systems matter as much as agents

Permalink to “Why evaluation systems matter as much as agents”

OpenAI treats their agent like production code. Most enterprises treat AI agents like experiments.

OpenAI built curated question-answer pairs. Each question targets important metrics. Each includes manually authored “golden” SQL producing expected results.

Every evaluation:

  1. Send question to agent
  2. Execute generated SQL
  3. Compare against expected result
  4. Feed both to Evals API for scoring

From the article: “Evals are like unit tests that run continuously during development to identify regressions as canaries in production.”

Why agents fail without evaluation

Permalink to “Why agents fail without evaluation”

Deploying without evaluation: code without tests. Quality drifts invisibly. Prompt changes break capabilities. Model updates introduce failures. Nobody notices until users complain about conflicting numbers.

Investment is real. Curate Q&A pairs. Write expected SQL. Build comparison logic handling syntactic differences while catching semantic errors.

It’s the difference between agents that improve versus degrade.

What to measure

Permalink to “What to measure”

Mapping accuracy: Right fields and relationships? “Customer satisfaction” → CSAT table or NPS table or retention score?

Coverage: Percentage of questions handled confidently? Does it know when to say “I lack context for accurate answers”?

Consistency: Same answer to same question? Or variation based on which tables context retrieval found first?

Confidence calibration: When confident, is it right? When hedging, genuinely unsure?

OpenAI’s threshold from the article: 80% accuracy, 70% consistency before deployment. Anything less means shipping agents that confidently deliver wrong answers.


What this means for data teams

Permalink to “What this means for data teams”

OpenAI’s success proves AI can transform data interaction. But the article reveals what matters more: what had to exist before AI could work.

Six context layers. Continuous evaluation. Self-improving memory. Infrastructure making organizational knowledge machine-readable.

The bottleneck isn’t the model. It’s the context layer underneath.

This changes the conversation. Not “which AI agent?” but “do we have context infrastructure to make any agent work reliably?”

Most don’t. They have documentation six months stale, business definitions varying by department, tribal knowledge locked in heads, data scattered across non-communicating systems, no systematic testing of AI correctness.

Building context infrastructure means solving these first. Making metadata active, not passive. Capturing business logic in graphs, not Confluence. Treating context as product, not byproduct.

The gap between proof-of-concept and production: layers of context.

That infrastructure requires investment—tools, yes, but more importantly how data teams work. Context engineering becomes core competency. Evaluation systems become mandatory. Memory and feedback loops become standard.

Alternative: join the 95% of AI pilots that fail. Not because models aren’t good enough. Not because teams aren’t smart enough. But because context infrastructure doesn’t exist to make agents reliable.

OpenAI’s article isn’t a blueprint for their exact system. It’s proof that context infrastructure matters more than model choice. Validation that data team work—documenting metadata, building glossaries, capturing lineage—isn’t just governance. It’s the foundation making AI possible.


FAQs about OpenAI’s data agent and enterprise AI analytics

Permalink to “FAQs about OpenAI’s data agent and enterprise AI analytics”

1. How is OpenAI’s internal data agent different from ChatGPT?

Permalink to “1. How is OpenAI’s internal data agent different from ChatGPT?”

OpenAI’s internal data agent is purpose-built for analyzing company data using natural language. Unlike ChatGPT providing general knowledge, the data agent is grounded in OpenAI’s specific tables, metrics, business logic, and institutional knowledge through six context layers. It serves 3,500+ employees across 600 petabytes of proprietary data.

2. What are the six layers of context in OpenAI’s data agent?

Permalink to “2. What are the six layers of context in OpenAI’s data agent?”

The six layers: (1) Table usage patterns from historical queries, (2) Human annotations capturing business meaning, (3) Codex enrichment extracting meaning from pipeline code, (4) Institutional knowledge from Slack/Docs/Notion, (5) Memory systems learning from corrections, (6) Runtime context via live warehouse queries. These ground the agent in organizational reality, not just schemas.

3. Why do 95% of AI data agents fail in production?

Permalink to “3. Why do 95% of AI data agents fail in production?”

Most fail because of the context gap—space between what AI knows and what humans know but haven’t documented. AI lacks tribal knowledge, edge cases, business definitions, judgment calls. Without context infrastructure capturing and delivering this, agents produce confidently wrong answers. OpenAI engineered six context layers before deploying. Most enterprises skip foundational work.

4. What is context engineering for AI analysts?

Permalink to “4. What is context engineering for AI analysts?”

Context engineering builds infrastructure connecting data definitions, metrics, and business rules to AI in machine-readable formats. Rather than documenting data, it makes organizational knowledge accessible through semantic layers, metadata graphs, and automated enrichment. Ensures AI agents understand “customer” or “revenue” identically to humans—with nuances, edge cases, exceptions.

5. What are the three types of context enterprises need for AI agents?

Permalink to “5. What are the three types of context enterprises need for AI agents?”

Enterprises need three context types: (1) Data context—schemas, lineage, relationships, query patterns, live validation; (2) Analytical context—business glossaries, metric definitions, semantic relationships, transformation logic; (3) Operational context—decision traces, approval history, edge cases, and exception rules. These map to OpenAI’s six layers but are architected for distributed enterprise environments.


Share this article

signoff-panel-logo

Atlan is the next-generation platform for data and AI governance. It is a control plane that stitches together a business's disparate data infrastructure, cataloging and enriching data with business context and security.

Permalink to “OpenAI data agent: Related reads”
 

Atlan named a Leader in 2026 Gartner® Magic Quadrant™ for D&A Governance. Read Report →

[Website env: production]