---
title: "Data Lineage RCA With MCP: Trace a Broken Dashboard to Root Cause"
url: "https://atlan.com/know/mcp/data-lineage-rca-with-mcp/"
description: "Learn how an AI agent runs data lineage RCA over MCP: a 7-step workflow that traverses governed lineage, inspects trust signals, and pinpoints root cause."
author: "Emily Winks"
author_role: "Data Governance Expert"
published: "2026-06-22"
updated: "2026-06-22"
---

---

Data lineage RCA with MCP is an agent-run workflow where an [AI agent](https://atlan.com/know/ai-agent/what-is-an-ai-agent/) uses the [Model Context Protocol](https://atlan.com/know/what-is-model-context-protocol/) to traverse governed lineage upstream from a failing asset, inspect the trust signals on each node, and pinpoint the root cause. Research cited by New Relic puts roughly 80% of incident resolution time on just locating the cause, not fixing it. MCP moves the query; the governed [Context Layer for AI](https://atlan.com/know/context-layer-enterprise-ai/) underneath, lineage plus quality, ownership, certification, and [policy context](https://atlan.com/know/context-layer-vs-semantic-layer/), produces the diagnosis. Lineage alone is a map; lineage plus trust signals is a diagnosis.

### Quick facts: data lineage RCA with MCP

| Field | Value |
|---|---|
| What it is | Agent-run root-cause analysis of a data incident, executed over MCP against governed lineage |
| Primary MCP tool | `traverse_lineage` (upstream + downstream) |
| Supporting MCP tools | `search_assets`, `get_assets_by_dsl`, `update_assets` |
| Steps | 7 (detect, traverse, inspect, identify, blast radius, route, report) |
| Prerequisite | Column-level lineage plus trust signals (quality, ownership, certification, policy context) in the graph |
| Why it works | Trust signals travel with the lineage query; topology alone cannot diagnose |
| Productized as | Atlan Root Cause Analysis Agent |
| Typical bottleneck it removes | Locating the cause (~80% of MTTR) |

---

## Why does root cause analysis fail without governed lineage?

Most data RCA time is spent locating the cause, not fixing it, and that is exactly where a bare protocol cannot help. The bottleneck is diagnosis: knowing which of dozens of upstream changes actually broke the number on a dashboard.

The volume is real. According to [Monte Carlo's State of Data Quality survey](https://www.montecarlodata.com/blog-data-quality-survey), the average organization experiences around 61 data incidents per month, each taking roughly 13 hours to detect and resolve. Monte Carlo's later research reported a 166% increase in time-to-resolution year over year. At that scale, manual upstream tracing does not keep up.

The time goes to one place. Research cited by [New Relic](https://newrelic.com/blog/observability/how-to-improve-mttr) finds that about 80% of mean time to resolution is spent identifying which change or component caused the incident, not repairing it. RCA is a search problem before it is a repair problem.

The foundation matters more than the model. According to [Gartner (2025)](https://www.gartner.com/en/newsroom/press-releases/2025-06-25-gartner-predicts-over-40-percent-of-agentic-ai-projects-will-be-canceled-by-end-of-2027), over 40% of [agentic AI projects](https://atlan.com/know/what-is-agentic-ai/) will be canceled by the end of 2027. Gartner attributes the cancellations to escalating costs, unclear business value, and inadequate risk controls. The pattern that sits underneath those failures is a missing data foundation: no lineage, unclear ownership, and inconsistent definitions. Protocols like MCP, A2A, and ANP standardize the query an agent sends; they do not make the underlying context certified, current, or correct. The durable asset is the [governed context layer](https://atlan.com/know/context-layer-enterprise-ai/) underneath, and RCA is where that distinction becomes visible. Lineage alone is a map, not a diagnosis.

---

## Prerequisites: what must already be in the graph

Before an agent can run RCA over MCP, the lineage graph and its trust signals must already exist, and that is the conviction in operational form. The protocol can reach the graph, but it cannot create the context the graph holds.

Three layers of prerequisite have to be in place.

**Technical prerequisites:**
- [Column-level lineage](https://atlan.com/know/training-data-lineage-for-llms/) across the full stack: source, ingestion, dbt or transform, warehouse, and BI.
- An MCP server that exposes `traverse_lineage` and `search_assets` so an agent can walk the graph programmatically.
- Orchestrator signals from Airflow, Prefect, or Dagster wired in as triggers so failures and anomalies become actionable events.

**Trust-signal prerequisites (the diagnosis layer):**
- Quality scores and quality-check results on each asset.
- Ownership and stewardship so a root cause has an accountable target.
- [Certification status](https://atlan.com/know/zero-trust-data-governance/), [business glossary](https://atlan.com/know/what-is-a-knowledge-graph/) terms, and policy context or classification attached to assets.

**Organizational prerequisite:** every asset has an accountable owner so the routing step has somewhere to send the ticket.

A quick test tells you the graph is ready. If the agent can call `traverse_lineage` and get back provenance, quality score, policy context, and ownership in one call, the graph is RCA-ready. Without those signals, the agent can see the topology but cannot tell a healthy node from a broken one.

---

## The 7-step agent-over-MCP RCA workflow

This is the agent-run workflow, seven numbered steps, that takes a data incident from symptom to a finished RCA report over MCP. Each step names the exact MCP tool it uses, so the sequence is reproducible against any MCP server that exposes governed lineage.

### Step 1: Detect the symptom

A trigger fires: a BI alert that revenue looks wrong, a freshness or volume anomaly from the orchestrator, or a failed quality check. The agent receives the failing asset ID, the failure type (quality, freshness, schema, or volume), and a timestamp. No lineage call yet; `search_assets` can resolve the asset by name if only a label arrived with the alert.

### Step 2: Identify the impacted asset and traverse upstream

The agent calls `traverse_lineage` upstream on the failing dashboard or dataset to walk the chain: BI metric, then aggregation table, then [transformation model](https://atlan.com/know/mcp/mcp-server-for-dbt/), then source table, then ingestion job. [Column-level lineage](https://atlan.com/column-level-lineage/) narrows the walk to only the columns feeding the broken value, so the agent does not chase fields that have nothing to do with the symptom.

### Step 3: Inspect each node's trust signals

At every upstream node the agent reads the context that travels with the lineage query: freshness and last-run time, quality-check results, recent schema changes, certification status, ownership, and any incidents or announcements. This is where map becomes diagnosis. A node that is connected but green is ruled out; a node that is connected and shows a schema change or a failed quality check the same morning becomes a root-cause candidate. Without these signals, the agent has edges and nothing to reason about.

### Step 4: Identify the root cause

The root cause is the most-upstream node whose change or failure timeline aligns with the downstream symptom. The agent uses `get_assets_by_dsl` to fetch change history and timestamps, then correlates the change timestamp against when the symptom first appeared. This confirms causation rather than coincidence, which is what separates a real RCA from a guess.

### Step 5: Assess downstream blast radius

The agent runs `traverse_lineage` downstream from the root-cause node to enumerate every other dashboard, dataset, and AI agent consuming the bad data. This is the impact-analysis half of lineage, and it tells stakeholders what else to quarantine or stop trusting until the fix lands.

### Step 6: Route to the owner

Using the ownership and stewardship signal on the root-cause asset, the agent routes a structured ticket to Slack or Jira and sends it to the accountable owner rather than the on-call generalist. It can call `update_assets` to annotate the affected asset with the open incident.

### Step 7: Generate the RCA report

The agent assembles the report: the symptom, the root cause, the lineage chain from cause to symptom, the timestamp correlation as evidence, the downstream blast radius, the owner, and a confidence note tied to lineage completeness. It uses `update_assets` to attach the RCA annotation to the asset record so the next investigator inherits the context.

### Which MCP tool runs which RCA step

| Step | Action | MCP tool |
|---|---|---|
| 1 | Detect symptom, resolve asset | (trigger) · `search_assets` |
| 2 | Traverse upstream lineage | `traverse_lineage` (upstream) |
| 3 | Inspect trust signals per node | `traverse_lineage` · `get_assets_by_dsl` |
| 4 | Confirm root cause via timestamps | `get_assets_by_dsl` |
| 5 | Assess downstream blast radius | `traverse_lineage` (downstream) |
| 6 | Route to owner | ownership signal · `update_assets` |
| 7 | Generate RCA report | `update_assets` |

---

## Worked example: tracing a wrong-revenue dashboard to its root cause

Here is the full workflow on one realistic incident: a Power BI "Weekly Sales" dashboard showing Q2 revenue down 12% on Monday morning, escalated by finance before the QBR. The [Power BI data lineage view](https://learn.microsoft.com/en-us/power-bi/collaborate-share/service-data-lineage), as Microsoft Learn documents, shows the dashboard's immediate sources but not the transformation that broke, which is exactly the gap the agent closes.

The detection step fires at 07:14 on Monday: a freshness and value-range quality check on the `weekly_sales` semantic model reports a value anomaly. The agent receives the asset, the failure type, and the timestamp, then calls `traverse_lineage` upstream and gets back this chain.

```
weekly_sales (Power BI)  ->  fct_revenue (dbt mart)  ->  stg_orders (dbt staging)
      ->  ORDERS_RAW (Snowflake source)  ->  orders_ingest (Fivetran job)
```

Now the agent inspects trust signals at each hop, which is Step 3 in action. `fct_revenue` is green and certified, so it is ruled out. `stg_orders` shows a schema change committed Sunday night. `ORDERS_RAW` shows that the column `order_total` was renamed to `order_amount` in the source on Sunday, while the `stg_orders` transform still references `order_total`, which now returns silent NULLs that aggregate into an understated total.

The root cause is the column rename in `ORDERS_RAW`. It is the most-upstream change, and its Sunday-night timestamp precedes the Monday symptom, so the causation is confirmed rather than assumed.

The blast radius matters as much as the cause. Traversing downstream, the agent finds that the same `order_amount` column also feeds the "Exec Revenue" dashboard and a churn-prediction agent, so both are flagged for stakeholders.

Routing closes the loop. The ownership signal on `ORDERS_RAW` points to the Ingestion team, so the ticket goes there, not to the BI on-call who would have spent hours tracing a problem they did not own. The RCA report proposes the fix: update `stg_orders` to reference `order_amount`, or add a rename alias. The diagnosis held because quality, ownership, and schema history traveled with the lineage query.

---

## What breaks RCA over MCP?

RCA over MCP fails in five predictable ways, and every one of them is a gap in the governed context, not in the protocol. The agent is only ever as good as the graph it queries.

1. **Incomplete lineage.** The agent hits a gap in the graph, cannot see past it, and returns a false root cause that costs more time than manual investigation would have.
2. **Table-level lineage only.** The agent knows table A feeds table B but not which column transformation failed, so it cannot resolve a field-level break like a renamed column.
3. **No trust signals on the graph.** The agent sees the topology but cannot tell a healthy node from a broken one. It has a map with no diagnosis.
4. **Stale lineage.** The graph reflects a pipeline that no longer exists, so the agent traces a dead path and reaches a node that is no longer in production.
5. **Treating the protocol as the solution.** MCP standardizes the query an agent sends; it does not make the underlying context certified, current, or correct.

The pattern across all five is the same. The protocol moves the query, and the [governed context layer](https://atlan.com/know/what-is-context-layer/) underneath produces the answer. Fix the context and the workflow works; fix only the protocol and it still fails.

---

## Column-level vs table-level lineage: which does RCA need?

RCA needs column-level lineage, because table-level lineage tells you that tables are connected but not which field actually broke. The `order_total` to `order_amount` rename in the worked example is unsolvable at the table level, since both tables are still connected and the edge looks healthy.

According to [Monte Carlo's analysis of table-level versus field-level lineage](https://www.montecarlodata.com/blog-table-level-vs-field-level-data-lineage-whats-the-difference/), column-level lineage pinpoints the exact field in the exact table feeding one report value, while table-level lineage only confirms that tables depend on each other. The difference is the resolution of the diagnosis.

| View | What it tells you |
|---|---|
| Table-level | "A feeds B" |
| Column-level | "`ORDERS_RAW.order_amount` feeds `fct_revenue.revenue`" |

It also helps to keep two operations distinct. RCA traverses upstream to find the cause; [impact analysis](https://atlan.com/know/ai-agent-observability/) traverses downstream to find what is affected. Both run on the same lineage in opposite directions, which is Step 2 versus Step 5 of the workflow above.

---

## Does MCP make data lineage trustworthy on its own?

No. MCP standardizes how an agent reaches lineage, but it does nothing to make that lineage certified, current, or correct. The protocol is the access method; the trust comes from the graph.

A skeptic could argue that a sufficiently capable model with raw SQL access could brute-force RCA by reading all the code and logs. In practice that does not scale to 61 incidents a month, it is non-deterministic, and it produces causation no one can audit. The governed graph is what makes the answer reproducible and the routing correct.

Observability vendors already do lineage-aware RCA without MCP, which proves the pattern works. The difference is that those are closed products, not an open protocol any agent can call, and the context layer underneath is still what does the diagnostic work. MCP is the pipe; context quality is what flows through it. That is why teams investing here build on the [enterprise context layer](https://atlan.com/know/context-layer-enterprise-ai/) rather than betting on any single protocol.

---

## How Atlan's Root Cause Analysis Agent runs this over MCP

Atlan ships this workflow as the Root Cause Analysis Agent, powered by the [Atlan MCP server](https://atlan.com/know/what-is-atlan-mcp/), so teams do not have to assemble it by hand. It is the productized version of the seven steps above, running against Atlan as the [Context Layer for AI](https://atlan.com/know/context-layer-enterprise-ai/).

Done manually today, the process is slow. An engineer opens the BI tool, hunts for the compiled SQL, checks source freshness, walks lineage by hand across the warehouse and transform layers, and often pings the wrong on-call. Most of that time is spent locating the problem, not fixing it.

The agent collapses that loop. It identifies the impacted asset, traverses lineage, detects failures across the chain (job failures, incidents, quality issues, and announcements), and generates a consolidated RCA report. It runs on the same governed [Enterprise Data Graph](https://atlan.com/data-lineage/) that the MCP server exposes through `traverse_lineage`, so an agent checking a column gets back provenance, quality score, policy context, and ownership in one call.

The usage is real, even where outcome metrics are not yet published. DataCamp ran 3,158 MCP calls in 14 days using Atlan. CME Group lineage spans on-prem Oracle, BigQuery, and Looker in a single graph. DigiKey treats Atlan as an MCP server delivering context to AI models. The connecting thread is that the agent is only as good as the [governed context](https://atlan.com/know/mcp-connected-data-catalog/) the MCP server hands it.

---

## Real stories from real customers: Governed context powering AI agents



      "We're excited to build the future of AI governance with Atlan. All of the work that we did to get to a shared language at Workday can be leveraged by AI via Atlan's MCP server…as part of Atlan's AI Labs, we're co-building the semantic layer that AI needs with new constructs, like context products."


      — Joe DosSantos, VP of Enterprise Data & Analytics, Workday




    Watch Now




      "Atlan is much more than a catalog of catalogs. It's more of a context operating system…Atlan enabled us to easily activate metadata for everything from discovery in the marketplace to AI governance to data quality to an MCP server delivering context to AI models."


      — Sridher Arumugham, Chief Data & Analytics Officer, DigiKey




    Watch Now


---

## Why governed context, not the protocol, is what closes an RCA

Data lineage RCA with MCP works because the protocol and the context play different roles. MCP standardizes how an agent reaches the graph and what it can ask; the governed Context Layer for AI underneath, lineage plus quality, ownership, certification, and policy context, is what turns a chain of edges into a diagnosis. Swap the protocol and the workflow still runs. Strip the trust signals and it collapses into a map with no answer. That is why the durable investment is the [governed context an agent retrieves](https://atlan.com/know/mcp-connected-data-catalog/), not the pipe it travels through. The worked example proves it: the rename in `ORDERS_RAW` was findable only because ownership, schema history, and quality scores traveled with the `traverse_lineage` query. Lineage alone is a map; lineage plus trust signals is a diagnosis.

  Book a Demo

---

## FAQs

### 1. How does an AI agent use lineage to find the cause of a broken dashboard?

The agent traverses lineage upstream from the failing dashboard through aggregation tables, transformation models, and source tables. At each node it reads trust signals like freshness, quality checks, schema changes, and ownership. The most-upstream node whose change timeline aligns with the symptom is the root cause.

### 2. What is the difference between root cause analysis and impact analysis?

Root cause analysis traverses lineage upstream to find what caused an incident. Impact analysis traverses the same lineage downstream to find every dataset, dashboard, and agent affected by it. They run in opposite directions on one graph, so a complete investigation does both.

### 3. Why is column-level lineage important for root cause analysis?

Table-level lineage only tells you that two tables are connected, not which field broke. Column-level lineage pinpoints the exact field feeding a report value, so it can resolve breaks like a renamed source column that table-level views miss entirely.

### 4. What breaks automated root cause analysis?

Incomplete lineage, table-level-only lineage, missing trust signals, and stale lineage all break it. Each leaves the agent with topology but no way to tell a healthy node from a broken one. The fifth failure is treating the protocol as the solution when the context underneath is what diagnoses.

### 5. Does MCP make data lineage trustworthy on its own?

No. MCP standardizes how an agent reaches and queries lineage, but it does not make that lineage certified, current, or correct. Trustworthiness comes from the governed context layer underneath: quality scores, ownership, certification, and policy context attached to each asset.

### 6. How long does data incident resolution take on average?

The average organization sees around 61 data incidents a month, each taking roughly 13 hours to detect and resolve, per Monte Carlo's State of Data Quality survey. Roughly 80% of that time goes to locating the cause rather than fixing it, which is the part automated RCA over governed lineage compresses.

---

## Sources

1. [State of Data Quality Survey, Monte Carlo](https://www.montecarlodata.com/blog-data-quality-survey)
2. [How to Improve MTTR, New Relic](https://newrelic.com/blog/observability/how-to-improve-mttr)
3. [Gartner Predicts Over 40% of Agentic AI Projects Will Be Canceled by End of 2027, Gartner](https://www.gartner.com/en/newsroom/press-releases/2025-06-25-gartner-predicts-over-40-percent-of-agentic-ai-projects-will-be-canceled-by-end-of-2027)
4. [Table-Level vs Field-Level Data Lineage, Monte Carlo](https://www.montecarlodata.com/blog-table-level-vs-field-level-data-lineage-whats-the-difference/)
5. [The Data Engineer's Guide to Root Cause Analysis, Monte Carlo](https://www.montecarlodata.com/blog-the-data-engineers-guide-to-root-cause-analysis/)
6. [Power BI Data Lineage View, Microsoft Learn](https://learn.microsoft.com/en-us/power-bi/collaborate-share/service-data-lineage)
7. [Launching an Agentic SRE for Root Cause Analysis, Mezmo](https://www.mezmo.com/blog/launching-an-agentic-sre-for-root-cause-analysis)