Unstructured Data Isn't a Storage Problem. It's an AI Lineage Problem.

Eighty to ninety percent of enterprise data is unstructured. Despite that, only 26% of data leaders say their organizations can confidently use it for business value.

That gap isn’t an oversight. It reflects how hard it is to govern content that doesn’t fit the patterns data teams have spent years mastering.

The problem emerged when enterprises started deploying AI agents. Before agents, unstructured data was a second-tier concern — it existed and had value, but it didn’t flow through the pipelines powering dashboards, models, or business decisions.

Agents changed that entirely. They pull context from Confluence docs, policy PDFs, and SLA definitions that were never designed to be source-of-truth for automated decisions. When that content is outdated, unowned, or simply wrong, agents don’t fail loudly. They produce plausible-sounding answers built on stale context. And downstream, business decisions are made on bad information, with no audit trail pointing to why.

Take a financial services firm that deploys an AI agent to assist underwriters. The agent surfaces policy guidelines, interprets risk thresholds, and cross-references compliance requirements — all from documents stored in Confluence and SharePoint. Six months in, a compliance audit flags a pattern of recommendations that contradict the firm’s current risk framework. The investigation traces the problem back to a policy document that was superseded eight months ago. The updated version existed, but the agent was reading the old one. No pipeline broke, no alert fired — the underwriters simply didn’t know to question it.

This is what unstructured data failure looks like in an AI-powered enterprise: a believable answer built on stale context, confidently delivered, and acted upon downstream. It’s not a storage problem or a cataloging problem. It’s an AI context problem.

Why the structured playbook breaks at the file layer

The instinct when confronted with the unstructured governance gap is to apply the same structured data approach: catalog, classify, assign owners, track lineage.

But a typical enterprise has tens of thousands of database tables and tens of millions of files. Applying structured governance patterns to this volume is not a harder version of the same problem. It’s a categorically different one.

Unstructured data is growing at a rate of 40-60% per year, according to Gartner analyst Melody Chien. Over the past 12 months, Chien reports that Gartner has seen a 150% increase in inquiries about unstructured data management. The reason? GenAI deployments are failing, and leaders are pointing to a lack of AI-ready data.

Gartner has mapped five vendor categories addressing parts of this problem:

Traditional data management vendors
Specialized emerging vendors
Callable AI functions from AI vendors
Document/content management
Data storage services vendors

However, none of these connects unstructured data governance directly to AI agent lineage – yet. The pieces exist, but the integration doesn’t.

Unstructured data is a cross-functional problem

The vendor landscape is clearly fragmented. Each category serves a different buyer with a different mandate. Security teams reach for data storage services vendors to track sensitive file exposure. Data governance teams use traditional data management platforms to catalog and classify. AI engineers rely on callable AI functions from cloud providers to chunk, embed, and vectorize. Content and records teams have their own category entirely, focused on document lifecycle and retention.

The result is four or five tool stacks with overlapping scope and no shared visibility. A file can be classified, cataloged, and embedded, and still feed incorrect context to an AI agent because no single system tracks its lineage end to end.

Gartner identifies four distinct capability layers required to close this gap: unstructured data integration, data quality, governance, and metadata management. Each is a prerequisite for the next. Teams that skip directly to governance without solving integration and quality first end up governing assets they can’t reliably ingest or measure.

Instead of trying to solve all four layers at once, the organizations making progress are starting where the AI risk is highest and working backward.

Why unstructured data is suddenly the most critical asset in your data stack

AI agents don’t just query tables and columns. They read documents. They parse policy PDFs to understand business rules. They pull from SLA definitions stored in Confluence. They surface context from abbreviation glossaries, data dictionaries, and tribal knowledge buried in files that predate the current data team.

When that doesn’t happen, agents hallucinate with confidence, apply outdated rules, and contradict documented policy because they’re reading an old version. The failure mode isn’t a broken pipeline you can debug — it’s a plausible-sounding answer that’s subtly wrong.

At the 2026 Gartner Data & Analytics Summit, Melody Chien and fellow analyst Nina Showell underscored the growing prominence of unstructured data, covering its role in making data AI-ready and how to govern it.

And the real-world implications they shared are significant: Showell projected that through 2028, data leaders who attempt to build their own unstructured metadata solutions will incur costs more than 300% higher than if they used existing documentation, records solutions, skills, and practices.

The message is clear for data leaders: those who can’t wrangle and feed unstructured data to “data-hungry” AI models will quickly fall behind on executing their AI strategies.

Not all unstructured data is created equal – here are the files that matter

A bit of good news: most unstructured data will never interact with your AI systems in any meaningful way. That’s helpful for prioritization. The files upstream of your agents are the ones that matter most for AI reliability.

This is the shift from blanket governance to lineage-driven governance. Instead of asking “how do we govern all our files?”, the right question is “which files are feeding which agents, and do we know what’s in them?”

When you look at it this way, the problem changes shape. You stop trying to govern ten million objects and start tracing which objects sit upstream of your organization’s decision-making intelligence. That set is much smaller and while it’s still complex, it’s tractable.

Two specific patterns are worth naming:

1. Files feeding AI agents directly. These are the documents, PDFs, and notes pulled into agent context windows or RAG pipelines — an SLA document, a data contract, a business rules definition. They are, functionally, part of your data pipeline even though they don’t live in your data warehouse. They need version control, ownership, and freshness tracking, just like any upstream data asset.

2. Object store volumes and folder lineage. Some organizations don’t think at the file level at all — they think at the bucket or folder level. “Everything in this S3 prefix feeds our agent” is a real, common governance need. The coarser unit of analysis is not a workaround. For many use cases, it’s the right level of abstraction.

Both patterns require the same fundamental shift: governance needs to follow AI agent lineage, not blanket file coverage.

Why the cloud providers aren’t going to solve this for you yet

It’s tempting to wait for the infrastructure layer to catch up. There are signals it’s coming — Google Cloud Storage is building a way to attach richer metadata to files dropped into GCS, with a private preview already running for images and expansion to other file types planned. All the major object stores will eventually move in this direction as buckets become the substrate for AI pipelines.

But storage-layer metadata is a long way from what data teams need today: tracing which unstructured assets are upstream of which AI decisions, knowing whether those assets are current and owned, and understanding whether a changed file has cascading implications for agent behavior.

Teams solving this problem now are doing it with programmatic, field-led approaches — mapping critical files to AI pipelines directly, building lightweight ownership models around specific folders and document repositories. It’s not elegant, but it’s the current state of the art. Anyone selling you a fully automated, enterprise-ready solution to unstructured data governance in a generative AI context is ahead of what the industry can actually deliver.

What good looks like now

Waiting for a complete platform solution is a mistake. The gap between structured data governance maturity and unstructured data governance maturity is already costing organizations reliability in their AI outputs. But there are ways to start closing that gap.

Gartner’s architecture framework views the data catalog and semantic knowledge graph as the connective tissue between unstructured source data and GenAI applications. The catalog feeds the metadata into the knowledge graph, which in turn enriches prompts. An enterprise context layer, which encodes business definitions, relationships, operational rules, lineage, and policies, closes context gaps so that AI answers reliably at enterprise scale.

The best current solution is to invest in complementary platforms. For instance, a data management provider paired with a data storage services provider enables AI pipeline coverage from orchestration through governance.

That’s what Atlan and BigID deliver with their joint solution, which combines structured and unstructured data discovery, classification, and lineage in a single control plane. This allows data and governance teams to see the same trusted, governed, and AI-ready data, so they can get AI initiatives into production faster.

Ultimately, the goal isn’t to govern everything. It’s to give security and governance teams the same visibility over the files feeding your AI agents as they already have over your data warehouse.

Here’s the sequence working in practice:

Start with your AI agents, not your files. Map your deployed or in-progress AI agents. For each one, trace the context they consume — documents, policies, definitions. This is your critical unstructured asset inventory: not a full catalog, but a directed graph of what’s actually upstream of intelligence.

Assign ownership at the document or folder level. The files that feed your agents need owners, freshness standards, and review cycles. The rest can wait.

Track changes, not just content. A document that was accurate six months ago and has since been updated — or deleted — is a live risk to your AI outputs. Governance here means knowing when upstream files change and whether that change matters to the agents depending on them.

In her presentation, Nina Showell suggested aligning AI capabilities to use cases, starting with quick wins, then focusing on GenAI, and finally taking on agentic AI.

Nina Showell, Gartner: AI capability maturity stages — quick wins, GenAI, agentic AI

AI capability sequencing: classic AI/ML quick wins, domain-specific GenAI functions, and agentic AI. Source: Nina Showell, Unstructured Data Management Is the Missing Ingredient to Prepare GenAI-Ready Data, 2026

For a financial services and insurance firm, this might look like addressing customer segmentation and fraud detection in phase one, then focusing on building a sales chatbot and document summarizer in phase two, and finally automating customer service tasks in phase three.

The bottom line is that the teams making the most progress aren’t waiting. They’re starting where the risk is highest: the files their agents are actually reading.

The window is shorter than it looks

Most enterprise data teams feel like they have time to figure this out. AI agents are still in pilot. The governance problems are theoretical, not yet producing visible failures in production.

But that window is closing. As agents move from pilot to production, as they make more consequential decisions, and as more of the enterprise context layer runs through them, the unstructured data they depend on becomes critical infrastructure. Without governance, that critical infrastructure lacks integrity checks.

From 2025 through 2029, the share of AI spending on data readiness will increase 7x, according to Nina Showell. That investment will compound for the organizations that start now. For those waiting, the gap will be harder to close with each passing quarter.

The organizations that establish AI agent lineage now will have a more reliable, more auditable, and more trustworthy AI stack — not because they cataloged everything, but because they governed the right things.

Share this article

Unstructured Data Isn't a Storage Problem. It's an AI Lineage Problem.

Key takeaways

Why the structured playbook breaks at the file layer

Unstructured data is a cross-functional problem

Why unstructured data is suddenly the most critical asset in your data stack

Not all unstructured data is created equal – here are the files that matter

Why the cloud providers aren’t going to solve this for you yet

What good looks like now

The window is shorter than it looks

Bridge the context gap.
Ship AI that works.

Unstructured Data Isn't a Storage Problem. It's an AI Lineage Problem.

Key takeaways

Why the structured playbook breaks at the file layer

Unstructured data is a cross-functional problem

Why unstructured data is suddenly the most critical asset in your data stack

Not all unstructured data is created equal – here are the files that matter

Why the cloud providers aren’t going to solve this for you yet

What good looks like now

The window is shorter than it looks

Unstructured Data and AI Governance: Related reads

Bridge the context gap.Ship AI that works.

Bridge the context gap.
Ship AI that works.