Quick facts: Unstructured data for AI
Permalink to “Quick facts: Unstructured data for AI”| Aspect | Detail |
|---|---|
| Share of enterprise data | 80 to 90% (Gartner, IDC, 2025) |
| Annual growth rate | 40 to 60% (Gartner analyst Melody Chien, 2026) |
| Tech spend allocation | 40% goes to unstructured vs. 60% to structured (IDC, 2023) |
| AI pilot failure rate from data readiness | Up to 60% (Gartner, 2026) |
| AI data-readiness spend, 2025 to 2029 | 7x increase projected (Gartner analyst Nina Showell, 2026) |
| Most common production failure mode | Stale or superseded document consumed by agent without a freshness flag |
Unstructured data for AI is any content an AI agent retrieves at inference time: PDFs, policy wikis, Slack threads, contracts, meeting notes. It represents 80 to 90% of enterprise data but receives only 40% of tech investment. Making it usable is not a parsing problem - teams building document parsing pipelines can extract clean text from most file types. Making it usable is a governance problem at the file level.
Why does the data with the most AI value get the least investment?
Permalink to “Why does the data with the most AI value get the least investment?”The most counterintuitive number in your enterprise AI budget is the spend allocation. IDC research found that 40% of tech spend goes to unstructured data while 60% goes to structured, even though unstructured content represents roughly 90% of what enterprises actually hold. The smaller pile gets the larger budget.
That math worked when unstructured data was a second-tier concern. PDFs and wiki pages did not flow through the pipelines powering dashboards or business decisions. AI agents changed that. Agents pull context directly from policy PDFs, SLA definitions, Confluence pages, and Slack threads at inference time. That content is no longer reference material. It is production input.
The volume framing has to reset. Indexing more documents into a vector store does not produce a better agent. It produces a wider surface area for the same governance failures. Teams building reliable AI data pipelines are learning that unstructured file coverage without governance is not a step forward - it is a liability that compounds with every additional document indexed.
Why does the structured-data playbook fail at the file layer?
Permalink to “Why does the structured-data playbook fail at the file layer?”When faced with the unstructured data governance gap, the first instinct is to apply the structured-data playbook: catalog everything, classify it, assign owners, track lineage. The problem is not that this instinct is wrong in principle. It is that unstructured data does not hold still long enough for blanket classification to work.
As Howard Friedman observed in InfoWorld, the rules governing how to treat different types of content change over time based on factors like content type, sensitivity, jurisdiction, and the specific use case. A contract may require one set of rules when signed, a second set when reviewed by legal during a dispute, and a third set when an agent is using natural language processing to analyze its clauses. Each rule change updates what permissions agents have to access the file.
Volume compounds the problem. A data warehouse has thousands of tables. An enterprise document estate has tens of millions of files across SharePoint, Confluence, email archives, and cloud storage. Blanket classification at that scale is not a harder version of the structured-data problem. It is a categorically different one - and it is why data quality approaches for AI built for structured data need a separate layer when applied to documents and wikis.
What does ‘usable for AI’ actually mean?
Permalink to “What does ‘usable for AI’ actually mean?”Usable for AI is not the same as parseable, chunked, embedded, or indexed. You can run hundreds of files through an advanced RAG pipeline and still have them be unusable to your agent. The distinction comes down to metadata.
Every file in your estate is either content or infrastructure, depending on whether it carries the five attributes an agent needs to trust it:
-
Owner. Someone is responsible for ensuring the information in the file is accurate and current.
-
Freshness SLA. How often should the file be reviewed? How long before it needs updating? What happens to previous versions?
-
Sensitivity classification. Who can access the file - and which agents - based on governed access policy?
-
Downstream consumers. Which agents and systems consume this file in their retrieval workflow?
-
Canonical concepts. Pointers to definitions of terms used within the document, drawn from the governed business glossary.
A file with all five attributes becomes AI-usable context: a source the agent can interpret, rely on, and trace.
Why does naive RAG break in production?
Permalink to “Why does naive RAG break in production?”Naive RAG solves a specific and real problem: giving the agent the right document at the right moment. It does not address the governance layer that sits above it.
When you embed semantic similarity into your retrievers, you embed similarity - not authority. When ten versions of a policy exist across a document store, the retriever returns whichever version is most similar to the query, not whichever is canonical. This creates three production failure patterns: stale retrieval (the agent cites a document superseded months ago), conflicting versions (multiple policy drafts return simultaneously), and permission violation (the agent surfaces restricted content to a user without authorization to view it).
IBM’s AI at the Core 2025 research found only 26% of organizations have more than moderate coverage of AI risks in governance frameworks. Most teams do not know which documents their agents are actually accessing. The agent produces a response. The response sounds authoritative. The document it drew from expired in 2021. This is the most common form of context poisoning in enterprise RAG deployments - and a primary driver of LLM knowledge base staleness that accumulates silently across large document estates.
What are the five governance failures that make unstructured data unusable?
Permalink to “What are the five governance failures that make unstructured data unusable?”When agents fail on unstructured retrieval in production, the cause maps to one of five governance failures. None is fixed at the model layer or the retrieval layer.
| Failure mode | What happens in production | What governance resolves it |
|---|---|---|
| Ungoverned ingestion | Every page indexed regardless of authority or accuracy | Owner and certification status at file level |
| Stale retrieval | Agent cites a document superseded months or years ago | Freshness SLA, version control, supersession tracking |
| Permission violation | Agent surfaces restricted content to unauthorized user | Access policy attached to file, enforced at retrieval |
| Conflicting versions | Multiple drafts of the same policy return from different stores | One canonical version flagged across all stores |
| No lineage to agent | Cannot trace which document caused a wrong answer | File-to-agent lineage graph with decision traces |
Which governance failure causes the most production damage at scale?
Permalink to “Which governance failure causes the most production damage at scale?”Stale retrieval, by a significant margin. It generates a plausible-sounding answer that no system flags as wrong. Embedding models do not verify document age. Vector stores do not enforce version rules. LLM knowledge base staleness accumulates silently across large estates. A freshness flag at the file level - a simple SLA that marks a document as requiring review after 90 days and surfaces a staleness warning at retrieval - delivers the highest return on investment for any unstructured RAG governance investment. Teams building decision traces for AI agents consistently identify stale document retrieval as the top root cause when tracing hallucinated answers back to their source.
How do you make this much content usable for AI tractably?
Permalink to “How do you make this much content usable for AI tractably?”The practical starting point is not blanket coverage. It is to identify which documents actually feed deployed or in-progress agents, connect those documents to the definitions and systems around them, and prioritize that context first. The set of files an agent actually consumes is the unstructured context that matters most.
Why can’t structured and unstructured data live in separate systems for AI?
Permalink to “Why can’t structured and unstructured data live in separate systems for AI?”A finance agent answering a question about quarterly revenue may need one source for the table value and another for the policy that defines how revenue is recorded. When those sources are disconnected, the agent has to bridge that gap at runtime. A unified context layer is what lets the agent move from question to canonical table to canonical policy in one reasoning path, instead of stitching those answers together from separate systems.
That also reduces the consistency problem across agents, because multiple agents can resolve through the same shared context instead of each building its own partial view.
How Atlan approaches unstructured data for AI
Permalink to “How Atlan approaches unstructured data for AI”The problem with unstructured data is not that enterprises lack documents. It is that agents retrieve those documents without enough context to know which file is current, which version matters, how a policy connects to the metric it governs, and whether that content should shape the answer at all.
Atlan addresses that by acting as the context layer across structured and unstructured sources. A warehouse table for revenue and the policy PDF that explains how revenue is recorded should not live in separate reasoning systems from the agent’s point of view. Atlan connects them into the same context graph so the agent can move from a number to the policy, definition, or related document that gives that number meaning.
That is also why lineage matters so much here. The practical starting point is not to catalog every file in the enterprise. It is to identify which files actually feed deployed agents, connect those files to the definitions and systems around them, and make that context available at runtime. Once you frame the problem that way, unstructured data becomes usable for AI not when every file is governed, but when the files that shape agent decisions carry the context the agent needs to interpret them correctly.
Why agent lineage, not blanket coverage, solves the unstructured data problem
Permalink to “Why agent lineage, not blanket coverage, solves the unstructured data problem”Gartner analyst Nina Showell projects that AI spending on data readiness will increase 7x from 2025 to 2029, and that data leaders who build their own unstructured metadata solutions will incur costs more than 300% higher than those using existing platforms - a projection documented in depth in Atlan’s enterprise memory research.
The compounding logic is direct: every additional document indexed without governance increases agent surface area for error. Every governed file reduces it. The right question is not whether to govern unstructured data. It is which files are upstream of which agents, and whether anyone owns them.
Lineage-driven governance is the only tractable answer at enterprise volume. Teams that start now compound the advantage across the next four years of AI deployment. Teams that wait spend the same period explaining production failures their agents could have avoided. The pattern holds across context infrastructure for AI agents: the investment in governance at the file layer today is the investment that determines whether agent answers are trustworthy tomorrow.
FAQs about unstructured data for AI
Permalink to “FAQs about unstructured data for AI”1. What makes unstructured data different to govern compared to tables and warehouses?
Permalink to “1. What makes unstructured data different to govern compared to tables and warehouses?”Volume scale, lack of native schema, and dynamic governance rules. A warehouse table has a defined structure and a finite owner set. A document repository can hold tens of millions of files across SharePoint, Confluence, and email, each with different sensitivity, freshness, and authority requirements that change as content evolves. The structured-data playbook does not extend cleanly to that volume or that rate of change.
2. Why does adding more documents to a vector store reduce agent accuracy?
Permalink to “2. Why does adding more documents to a vector store reduce agent accuracy?”Vector retrieval encodes semantic similarity, not authority. When ten versions of a policy exist across a document store, the model returns whichever is most similar to the query, not whichever is canonical. Larger vector stores compound this problem. The fix is governance metadata at the file level - specifically a canonical flag and a freshness SLA - not more storage capacity or better embedding models.
3. What is AI agent lineage for unstructured data?
Permalink to “3. What is AI agent lineage for unstructured data?”AI agent lineage traces which specific documents and files feed which deployed agents. Instead of trying to govern every file an enterprise holds, lineage-driven governance identifies the much smaller set of files actually consumed by agents and prioritizes ownership, freshness, and policy enforcement on that subset first. This makes the governance problem tractable at enterprise volume.
4. How is governing unstructured data different from data security posture management?
Permalink to “4. How is governing unstructured data different from data security posture management?”DSPM tools track sensitive-data exposure across cloud storage and flag files that violate security policy. Governance for AI tracks authority, freshness, and lineage to agents. A DSPM tool tells you a customer record sits in an unsecured bucket. Governance for AI tells you a 2021 policy is feeding a compliance agent that should be reading the 2024 version. Both layers matter and they solve different problems.
5. Do I need to classify every document in my enterprise before building an AI agent?
Permalink to “5. Do I need to classify every document in my enterprise before building an AI agent?”No. Map your deployed or in-progress AI agents first. Trace the documents each one actually consumes through its retrieval layer. Govern that set. The remaining files can wait until they enter an agent context. Trying to classify everything before deploying anything stalls projects indefinitely and delivers governance coverage where it provides the least value.
6. Where does the EU AI Act apply to unstructured documents feeding AI agents?
Permalink to “6. Where does the EU AI Act apply to unstructured documents feeding AI agents?”Article 12 requires source traceability for high-risk AI systems, and the requirement does not stop at structured data. Agents reading policy PDFs, regulatory documents, or compliance guidance trigger the same logging and lineage obligations as agents reading database outputs. Without file-to-agent lineage, demonstrating Article 12 compliance for unstructured retrieval becomes structurally difficult and expensive to retrofit after deployment.
Sources
Permalink to “Sources”-
Gartner. (2026). Lack of AI-ready data puts AI projects at risk. https://www.gartner.com/en/newsroom/press-releases/2025-02-26-lack-of-ai-ready-data-puts-ai-projects-at-risk
-
IDC / Box. (2023). 90% of your data is unstructured. https://blog.box.com/90-your-data-unstructured-and-its-full-untapped-value
-
IBM Institute for Business Value. (2025). CEO’s guide to AI decision-making. https://www.ibm.com/thought-leadership/institute-business-value/en-us/report/ceo-decision-making
-
InfoWorld / Howard Friedman. (2026). Addressing the challenges of unstructured data governance for AI. https://www.infoworld.com/article/4160979/addressing-the-challenges-of-unstructured-data-governance-for-ai.html
-
TD Sarma, Atlan. (2026). Unstructured data isn’t a storage problem - it’s an AI lineage problem. https://atlan.com/know/unstructured-data-ai-lineage/
-
PR Newswire. (March 2026). BigID & Atlan introduce the first unified structured & unstructured data catalog for AI governance. https://www.prnewswire.com/news-releases/bigid–atlan-introduce-the-first-unified-structured–unstructured-data-catalog-for-ai-governance-302708291.html
