Unstructured Data for AI: How to Make It Usable

Emily Winks profile picture
Data Governance Expert
Updated:05/27/2026
|
Published:05/27/2026
13 min read

Key takeaways

  • Unstructured data is 80-90% of enterprise data but gets only 40% of tech investment — the primary source AI agents retrieve.
  • Most agent failures are governance failures, not retrieval. Stale files and missing owners produce confident wrong answers.
  • Govern by AI agent lineage, not blanket file coverage. Trace from deployed agents to the specific documents they consume.
  • A document is AI-usable only when it carries owner, freshness SLA, sensitivity, lineage, and canonical definition links.

What is unstructured data for AI?

Unstructured data for AI is any content — PDFs, wikis, Slack threads, policy files — that an AI agent retrieves at inference time. Gartner (2026) reports 60% of AI pilots fail due to inadequate data readiness. The gap is not retrieval; it is governance. A file is AI-usable only when it carries five attributes: an identified owner, a freshness SLA, sensitivity classification, lineage to downstream agents, and links to canonical business definitions.

Five attributes for file usability:

  • Owner: Someone accountable for ensuring the file is accurate and current at all times
  • Freshness SLA: A defined review cadence, version control, and supersession tracking applied at the file level
  • Sensitivity classification: Governed access policy specifying which users and agents may retrieve the file
  • Agent lineage: A trace from the file to every downstream agent consuming it in retrieval workflows
  • Canonical concept links: Pointers to certified business definitions for terms used within the document

Is your data estate AI-agent ready?

Assess Your Readiness

Quick facts: Unstructured data for AI

Permalink to “Quick facts: Unstructured data for AI”
Aspect Detail
Share of enterprise data 80 to 90% (Gartner, IDC, 2025)
Annual growth rate 40 to 60% (Gartner analyst Melody Chien, 2026)
Tech spend allocation 40% goes to unstructured vs. 60% to structured (IDC, 2023)
AI pilot failure rate from data readiness Up to 60% (Gartner, 2026)
AI data-readiness spend, 2025 to 2029 7x increase projected (Gartner analyst Nina Showell, 2026)
Most common production failure mode Stale or superseded document consumed by agent without a freshness flag

Unstructured data for AI is any content an AI agent retrieves at inference time: PDFs, policy wikis, Slack threads, contracts, meeting notes. It represents 80 to 90% of enterprise data but receives only 40% of tech investment. Making it usable is not a parsing problem - teams building document parsing pipelines can extract clean text from most file types. Making it usable is a governance problem at the file level.


Why does the data with the most AI value get the least investment?

Permalink to “Why does the data with the most AI value get the least investment?”

The most counterintuitive number in your enterprise AI budget is the spend allocation. IDC research found that 40% of tech spend goes to unstructured data while 60% goes to structured, even though unstructured content represents roughly 90% of what enterprises actually hold. The smaller pile gets the larger budget.

That math worked when unstructured data was a second-tier concern. PDFs and wiki pages did not flow through the pipelines powering dashboards or business decisions. AI agents changed that. Agents pull context directly from policy PDFs, SLA definitions, Confluence pages, and Slack threads at inference time. That content is no longer reference material. It is production input.

The volume framing has to reset. Indexing more documents into a vector store does not produce a better agent. It produces a wider surface area for the same governance failures. Teams building reliable AI data pipelines are learning that unstructured file coverage without governance is not a step forward - it is a liability that compounds with every additional document indexed.


Why does the structured-data playbook fail at the file layer?

Permalink to “Why does the structured-data playbook fail at the file layer?”

When faced with the unstructured data governance gap, the first instinct is to apply the structured-data playbook: catalog everything, classify it, assign owners, track lineage. The problem is not that this instinct is wrong in principle. It is that unstructured data does not hold still long enough for blanket classification to work.

As Howard Friedman observed in InfoWorld, the rules governing how to treat different types of content change over time based on factors like content type, sensitivity, jurisdiction, and the specific use case. A contract may require one set of rules when signed, a second set when reviewed by legal during a dispute, and a third set when an agent is using natural language processing to analyze its clauses. Each rule change updates what permissions agents have to access the file.

Volume compounds the problem. A data warehouse has thousands of tables. An enterprise document estate has tens of millions of files across SharePoint, Confluence, email archives, and cloud storage. Blanket classification at that scale is not a harder version of the structured-data problem. It is a categorically different one - and it is why data quality approaches for AI built for structured data need a separate layer when applied to documents and wikis.


What does ‘usable for AI’ actually mean?

Permalink to “What does ‘usable for AI’ actually mean?”

Usable for AI is not the same as parseable, chunked, embedded, or indexed. You can run hundreds of files through an advanced RAG pipeline and still have them be unusable to your agent. The distinction comes down to metadata.

Every file in your estate is either content or infrastructure, depending on whether it carries the five attributes an agent needs to trust it:

  • Owner. Someone is responsible for ensuring the information in the file is accurate and current.

  • Freshness SLA. How often should the file be reviewed? How long before it needs updating? What happens to previous versions?

  • Sensitivity classification. Who can access the file - and which agents - based on governed access policy?

  • Downstream consumers. Which agents and systems consume this file in their retrieval workflow?

  • Canonical concepts. Pointers to definitions of terms used within the document, drawn from the governed business glossary.

A file with all five attributes becomes AI-usable context: a source the agent can interpret, rely on, and trace.


Why does naive RAG break in production?

Permalink to “Why does naive RAG break in production?”

Naive RAG solves a specific and real problem: giving the agent the right document at the right moment. It does not address the governance layer that sits above it.

When you embed semantic similarity into your retrievers, you embed similarity - not authority. When ten versions of a policy exist across a document store, the retriever returns whichever version is most similar to the query, not whichever is canonical. This creates three production failure patterns: stale retrieval (the agent cites a document superseded months ago), conflicting versions (multiple policy drafts return simultaneously), and permission violation (the agent surfaces restricted content to a user without authorization to view it).

IBM’s AI at the Core 2025 research found only 26% of organizations have more than moderate coverage of AI risks in governance frameworks. Most teams do not know which documents their agents are actually accessing. The agent produces a response. The response sounds authoritative. The document it drew from expired in 2021. This is the most common form of context poisoning in enterprise RAG deployments - and a primary driver of LLM knowledge base staleness that accumulates silently across large document estates.


What are the five governance failures that make unstructured data unusable?

Permalink to “What are the five governance failures that make unstructured data unusable?”

When agents fail on unstructured retrieval in production, the cause maps to one of five governance failures. None is fixed at the model layer or the retrieval layer.

Failure mode What happens in production What governance resolves it
Ungoverned ingestion Every page indexed regardless of authority or accuracy Owner and certification status at file level
Stale retrieval Agent cites a document superseded months or years ago Freshness SLA, version control, supersession tracking
Permission violation Agent surfaces restricted content to unauthorized user Access policy attached to file, enforced at retrieval
Conflicting versions Multiple drafts of the same policy return from different stores One canonical version flagged across all stores
No lineage to agent Cannot trace which document caused a wrong answer File-to-agent lineage graph with decision traces

Which governance failure causes the most production damage at scale?

Permalink to “Which governance failure causes the most production damage at scale?”

Stale retrieval, by a significant margin. It generates a plausible-sounding answer that no system flags as wrong. Embedding models do not verify document age. Vector stores do not enforce version rules. LLM knowledge base staleness accumulates silently across large estates. A freshness flag at the file level - a simple SLA that marks a document as requiring review after 90 days and surfaces a staleness warning at retrieval - delivers the highest return on investment for any unstructured RAG governance investment. Teams building decision traces for AI agents consistently identify stale document retrieval as the top root cause when tracing hallucinated answers back to their source.


How do you make this much content usable for AI tractably?

Permalink to “How do you make this much content usable for AI tractably?”

The practical starting point is not blanket coverage. It is to identify which documents actually feed deployed or in-progress agents, connect those documents to the definitions and systems around them, and prioritize that context first. The set of files an agent actually consumes is the unstructured context that matters most.


Why can’t structured and unstructured data live in separate systems for AI?

Permalink to “Why can’t structured and unstructured data live in separate systems for AI?”

A finance agent answering a question about quarterly revenue may need one source for the table value and another for the policy that defines how revenue is recorded. When those sources are disconnected, the agent has to bridge that gap at runtime. A unified context layer is what lets the agent move from question to canonical table to canonical policy in one reasoning path, instead of stitching those answers together from separate systems.

That also reduces the consistency problem across agents, because multiple agents can resolve through the same shared context instead of each building its own partial view.


How Atlan approaches unstructured data for AI

Permalink to “How Atlan approaches unstructured data for AI”

The problem with unstructured data is not that enterprises lack documents. It is that agents retrieve those documents without enough context to know which file is current, which version matters, how a policy connects to the metric it governs, and whether that content should shape the answer at all.

Atlan addresses that by acting as the context layer across structured and unstructured sources. A warehouse table for revenue and the policy PDF that explains how revenue is recorded should not live in separate reasoning systems from the agent’s point of view. Atlan connects them into the same context graph so the agent can move from a number to the policy, definition, or related document that gives that number meaning.

That is also why lineage matters so much here. The practical starting point is not to catalog every file in the enterprise. It is to identify which files actually feed deployed agents, connect those files to the definitions and systems around them, and make that context available at runtime. Once you frame the problem that way, unstructured data becomes usable for AI not when every file is governed, but when the files that shape agent decisions carry the context the agent needs to interpret them correctly.


Why agent lineage, not blanket coverage, solves the unstructured data problem

Permalink to “Why agent lineage, not blanket coverage, solves the unstructured data problem”

Gartner analyst Nina Showell projects that AI spending on data readiness will increase 7x from 2025 to 2029, and that data leaders who build their own unstructured metadata solutions will incur costs more than 300% higher than those using existing platforms - a projection documented in depth in Atlan’s enterprise memory research.

The compounding logic is direct: every additional document indexed without governance increases agent surface area for error. Every governed file reduces it. The right question is not whether to govern unstructured data. It is which files are upstream of which agents, and whether anyone owns them.

Lineage-driven governance is the only tractable answer at enterprise volume. Teams that start now compound the advantage across the next four years of AI deployment. Teams that wait spend the same period explaining production failures their agents could have avoided. The pattern holds across context infrastructure for AI agents: the investment in governance at the file layer today is the investment that determines whether agent answers are trustworthy tomorrow.



FAQs about unstructured data for AI

Permalink to “FAQs about unstructured data for AI”

1. What makes unstructured data different to govern compared to tables and warehouses?

Permalink to “1. What makes unstructured data different to govern compared to tables and warehouses?”

Volume scale, lack of native schema, and dynamic governance rules. A warehouse table has a defined structure and a finite owner set. A document repository can hold tens of millions of files across SharePoint, Confluence, and email, each with different sensitivity, freshness, and authority requirements that change as content evolves. The structured-data playbook does not extend cleanly to that volume or that rate of change.

2. Why does adding more documents to a vector store reduce agent accuracy?

Permalink to “2. Why does adding more documents to a vector store reduce agent accuracy?”

Vector retrieval encodes semantic similarity, not authority. When ten versions of a policy exist across a document store, the model returns whichever is most similar to the query, not whichever is canonical. Larger vector stores compound this problem. The fix is governance metadata at the file level - specifically a canonical flag and a freshness SLA - not more storage capacity or better embedding models.

3. What is AI agent lineage for unstructured data?

Permalink to “3. What is AI agent lineage for unstructured data?”

AI agent lineage traces which specific documents and files feed which deployed agents. Instead of trying to govern every file an enterprise holds, lineage-driven governance identifies the much smaller set of files actually consumed by agents and prioritizes ownership, freshness, and policy enforcement on that subset first. This makes the governance problem tractable at enterprise volume.

4. How is governing unstructured data different from data security posture management?

Permalink to “4. How is governing unstructured data different from data security posture management?”

DSPM tools track sensitive-data exposure across cloud storage and flag files that violate security policy. Governance for AI tracks authority, freshness, and lineage to agents. A DSPM tool tells you a customer record sits in an unsecured bucket. Governance for AI tells you a 2021 policy is feeding a compliance agent that should be reading the 2024 version. Both layers matter and they solve different problems.

5. Do I need to classify every document in my enterprise before building an AI agent?

Permalink to “5. Do I need to classify every document in my enterprise before building an AI agent?”

No. Map your deployed or in-progress AI agents first. Trace the documents each one actually consumes through its retrieval layer. Govern that set. The remaining files can wait until they enter an agent context. Trying to classify everything before deploying anything stalls projects indefinitely and delivers governance coverage where it provides the least value.

6. Where does the EU AI Act apply to unstructured documents feeding AI agents?

Permalink to “6. Where does the EU AI Act apply to unstructured documents feeding AI agents?”

Article 12 requires source traceability for high-risk AI systems, and the requirement does not stop at structured data. Agents reading policy PDFs, regulatory documents, or compliance guidance trigger the same logging and lineage obligations as agents reading database outputs. Without file-to-agent lineage, demonstrating Article 12 compliance for unstructured retrieval becomes structurally difficult and expensive to retrofit after deployment.


Sources

Permalink to “Sources”
  1. Gartner. (2026). Lack of AI-ready data puts AI projects at risk. https://www.gartner.com/en/newsroom/press-releases/2025-02-26-lack-of-ai-ready-data-puts-ai-projects-at-risk

  2. IDC / Box. (2023). 90% of your data is unstructured. https://blog.box.com/90-your-data-unstructured-and-its-full-untapped-value

  3. IBM Institute for Business Value. (2025). CEO’s guide to AI decision-making. https://www.ibm.com/thought-leadership/institute-business-value/en-us/report/ceo-decision-making

  4. InfoWorld / Howard Friedman. (2026). Addressing the challenges of unstructured data governance for AI. https://www.infoworld.com/article/4160979/addressing-the-challenges-of-unstructured-data-governance-for-ai.html

  5. TD Sarma, Atlan. (2026). Unstructured data isn’t a storage problem - it’s an AI lineage problem. https://atlan.com/know/unstructured-data-ai-lineage/

  6. PR Newswire. (March 2026). BigID & Atlan introduce the first unified structured & unstructured data catalog for AI governance. https://www.prnewswire.com/news-releases/bigid–atlan-introduce-the-first-unified-structured–unstructured-data-catalog-for-ai-governance-302708291.html

Share this article

signoff-panel-logo

Atlan is the next-generation platform for data and AI governance — the context layer that makes AI agents trustworthy in production. From metadata management to enterprise-wide lineage, Atlan gives your team the foundation to deploy, monitor, and scale AI agents with confidence.

Bridge the context gap.
Ship AI that works.

[Website env: production]