LLM Reasoning: Chain-of-Thought vs ReAct vs Reflection

Q: 4. What is reflection in LLM agents?

Reflection prompts the model to critique its own output. Self-reflection catches logical inconsistencies. External reflection, where a separate evaluator assesses against ground truth, catches factual errors outside the generating model's knowledge boundary.

Emily Winks

Data Governance Expert

Updated:05/15/2026

Published:05/15/2026

19 min read

Watch Context Agents Live Get the Context Layer Ebook

Key takeaways

Chain-of-thought, ReAct, and reflection govern how an agent thinks. None govern what it knows about the business.
Wharton's 2025 research found chain-of-thought adds 20-80% in time and tokens with marginal benefit on reasoning models.
Self-reflection cannot catch errors at the premise level; reasoning from a wrong definition stays internally consistent.
Production reliability needs three things: reasoning matched to task, a governed context layer, and systematic evaluation.

How do chain-of-thought, ReAct, and reflection differ?

Chain-of-thought, ReAct, and reflection are three reasoning techniques that govern how an LLM thinks through a problem: linear step-by-step derivation, an iterative Reason-Act-Observe loop with tool calls, and a self-critique pass after generation. Wharton's 2025 research found chain-of-thought adds 20-80% in time and tokens with marginal benefit on reasoning models.

Three techniques, one shared limitation:

Chain-of-thought: linear, sequential reasoning. Best for multi-step derivation.
ReAct: iterative Reason-Act-Observe loop with tool calls. Best for retrieval-dependent tasks.
Reflection: self-critique pass after generation. Best for coherence checking.

Assess Your Context Readiness

Assess Your Readiness

Three techniques dominate the conversation about LLM reasoning: chain-of-thought, ReAct, and reflection. Each governs how a model thinks through a problem. None governs what the model knows about your business. Understanding the difference between those two layers — reasoning technique versus context quality — is the most important distinction in production AI engineering.

This page breaks down how each technique works, where each fits, what the research actually says, and why all three share the same upstream dependency.

Key fact	Detail
Chain-of-thought origin	Wei et al., NeurIPS 2022
ReAct origin	Yao et al., ICLR 2023
Wharton CoT finding	20-80% more time and tokens; marginal benefit on reasoning models
ReAct ALFWorld result	34% absolute improvement over baselines
Shared limitation	All three depend on the quality of context in the window

Get the blueprint for implementing AI context graphs across your enterprise.

Get the Stack Guide

Reasoning technique is the second-order decision. Context quality is the first. Most teams get this backwards — they optimize the reasoning loop before they have a governed, accurate context layer. The result is a more sophisticated path to the same wrong answer.

Chain-of-thought asks the model to show its work. ReAct asks the model to interleave reasoning with tool calls and observations. Reflection asks the model to critique its own output. Each technique adds structure to the thinking process. None of them adds knowledge the model does not already have.

How does chain-of-thought work, and where does it break?

Chain-of-thought (CoT) was formalized in Wei et al.'s 2022 NeurIPS paper. The core idea is simple: prompting the model to generate explicit intermediate reasoning steps before producing a final answer improves performance on complex tasks. Instead of jumping from question to answer, the model works through the problem step by step, and each step becomes part of the context for the next.

Two variants exist. Few-shot CoT provides example reasoning traces in the prompt — the model sees how a similar problem was solved and follows the pattern. Zero-shot CoT requires no examples; it simply appends a phrase like “Let’s think step by step” to the prompt, which is enough to elicit a reasoning trace on most capable models.

The Wharton Generative AI Lab’s 2025 research changed how teams think about CoT. Their prompting science report found that chain-of-thought adds 20-80% in time and tokens with marginal benefit when used with modern reasoning models. The models that were improved most by CoT in 2022 have since been superseded by models that reason more effectively by default. Adding explicit CoT prompting to a reasoning model often adds cost without adding accuracy.

The practical implication: CoT is most valuable on tasks where the derivation path itself is the deliverable — SQL generation, multi-step calculations, structured analysis — and on older or smaller models where the reasoning scaffold meaningfully improves output. It is less valuable as a universal technique applied to every prompt regardless of model or task type.

Where does chain-of-thought work best?

Chain-of-thought performs well on tasks where the path to the answer involves multiple dependent steps and the intermediate steps are verifiable. SQL query construction from natural language is a clean example: the model needs to identify the relevant tables, determine join conditions, apply filters, and construct the query in the correct order. Each step depends on the previous one, and a visible reasoning trace makes errors easier to detect and correct.

Multi-step arithmetic, structured data extraction, and formal analysis tasks follow the same pattern. CoT also helps on tasks where the model needs to distinguish between similar options — classification problems with ambiguous boundaries benefit from a reasoning trace that makes the decision criteria explicit.

Where does chain-of-thought break?

Chain-of-thought breaks when the premise is wrong. If the model’s definition of “active customer” is incorrect, a CoT trace will produce a logically coherent, internally consistent, well-structured wrong answer. The reasoning steps will follow correctly from the wrong premise. Self-consistency checking — running multiple CoT traces and taking the majority answer — does not fix this, because all traces share the same faulty premise.

CoT also breaks when the task requires information outside the context window. The model cannot reason its way to data it does not have. And on modern reasoning models, CoT often adds latency and token cost without a commensurate accuracy gain, as the Wharton research showed.

How does ReAct interleave reasoning and acting?

ReAct was introduced in Yao et al.'s ICLR 2023 paper. The name is a portmanteau: Reasoning + Acting. The technique interleaves reasoning traces with action calls — typically tool calls — in a loop: Reason about the current state, Act by calling a tool, Observe the result, then Reason again based on the updated context.

The original paper showed ReAct outperforming pure chain-of-thought by 34% absolute on ALFWorld, a household task benchmark, using only one or two in-context examples. The improvement came from the model’s ability to retrieve new information mid-reasoning rather than relying solely on what was in the initial context window.

In production agent systems, the ReAct loop is the dominant pattern for any task that requires data retrieval. The agent reasons about what it needs, calls a tool to retrieve it, observes the result, incorporates the new information, and reasons about the next step. This loop continues until the task is complete or a termination condition is reached.

Why does ReAct matter for enterprise agents?

Enterprise data is not static and does not fit in a context window. A data agent answering a question about quarterly revenue needs to retrieve actual revenue figures from a data warehouse, apply the correct definition of revenue from the semantic layer, check for any pending adjustments from the accounting team, and then reason about the result. No amount of chain-of-thought reasoning produces the right answer without those retrieval steps.

ReAct makes those retrieval steps a first-class part of the reasoning process. The agent does not have to front-load all possible information into the prompt — it retrieves what it needs, when it needs it, and incorporates each observation into its next reasoning step.

How do ReAct loops handle token budgets?

Token budget is a real constraint on ReAct agents. Each Reason-Act-Observe cycle adds tokens: the reasoning trace, the tool call specification, and the observation all consume context window space. Long ReAct loops on complex tasks can exhaust the budget before the task is complete.

Budget-aware implementations track consumption at each step and adjust behavior as the budget shrinks. Common strategies include shortening reasoning traces in later steps, batching tool calls where possible, and synthesizing from partial information when the budget approaches the limit. Some implementations use a separate budget controller that monitors token consumption and signals the agent to wrap up before the window fills.

Where does ReAct work best?

ReAct performs best on tasks that require information retrieval during reasoning — not just at the start. Research tasks where the agent needs to follow a chain of citations, data analysis tasks where intermediate results determine the next query, and workflow automation tasks where each step depends on the outcome of the previous one are all strong fits.

ReAct also performs well on tasks with a clear success criterion that can be verified mid-loop. If the agent can check its intermediate results against a known expected format or value range, the observe step catches errors before they compound.

Where does ReAct break?

ReAct breaks when tool calls return unreliable or ambiguous results. The observe step incorporates whatever the tool returns — if the tool returns stale data, the agent reasons from stale data. If the tool returns an error, the agent must decide how to proceed without the expected information.

ReAct also breaks when the context layer feeding the tools is ungoverned. A ReAct agent calling a semantic layer with conflicting definitions of a business metric will retrieve one of the conflicting definitions and reason from it confidently. The loop structure does not detect the conflict — it just propagates it.

For data leaders: a practical four-layer architecture from metadata foundation to agent orchestration.

Get the CIO Guide

How does reflection critique an LLM’s own output?

Reflection prompts the model to critique its own output after generation. The simplest form is self-reflection: the model generates an answer, then receives a prompt asking it to evaluate the answer for errors, inconsistencies, or gaps, and to produce a revised answer if needed. The Reflexion paper formalized this pattern and showed that verbal self-feedback can meaningfully improve agent performance across reasoning and coding tasks.

External reflection uses a separate evaluator — a different model, a rule-based checker, or a human reviewer — to assess the generated output against ground truth or a rubric. External reflection can catch factual errors that are outside the generating model’s knowledge boundary, which self-reflection cannot.

In production systems, reflection is most commonly used for high-stakes outputs: compliance documents, financial reports, and customer-facing communications where the cost of an error is high. The reflection step adds latency and token cost, so it is not applied universally — it is reserved for outputs where the verification is worth the overhead.

What is the boundary of self-critique?

Self-reflection can catch logical inconsistencies, formatting errors, and reasoning steps that do not follow from their premises. It cannot catch factual errors that the model does not know are wrong. If the model believes that a particular regulation went into effect in 2023 when it actually went into effect in 2024, self-reflection will not surface that error — the model will evaluate its answer as correct.

This is the hard boundary of self-critique: the model can only evaluate its output against its own knowledge. Errors at the knowledge boundary — wrong definitions, outdated facts, incorrect business rules — pass through reflection unchanged. External reflection with ground-truth verification is the only technique that catches these errors systematically.

Chain-of-thought, ReAct, and reflection are all model-internal strategies. They improve how the agent thinks. They cannot improve what the agent knows. Every technique depends on the quality of the context in the window — the definitions, rules, examples, and facts that the agent reasons from.

Dimension	Chain-of-thought	ReAct	Reflection
Reasoning model	Linear, sequential	Iterative loop	Self-critique pass
External tool calls	No	Yes	Optional (external reflection)
Context window only	Yes	No	Yes (self) / No (external)
Catches premise errors	No	No	No (self) / Partial (external)
Token cost	Medium	High	Medium to high
Best for	Multi-step derivation, structured analysis	Retrieval-dependent tasks, live data	Coherence checking, high-stakes outputs
Enterprise use case	SQL reasoning, report generation	Data agent workflows, research	Compliance outputs, financial reporting

The table makes the shared dependency visible. Every row in the “Catches premise errors” column is No or Partial. No technique catches errors at the premise level reliably. A wrong definition of revenue, an outdated policy rule, a business metric that changed meaning after a product launch — these errors flow through all three techniques and produce wrong answers that are internally consistent, confidently stated, and structurally correct.

Context quality is the upstream variable. Reasoning technique is the downstream amplifier. Optimizing the amplifier before fixing the upstream signal produces a more efficient path to the wrong answer.

Learn how context engineering drove 5x AI accuracy in real customer systems.

Download E-book

How do reasoning, context, and evaluation work together?

Production reliability requires three components working together, not one optimized in isolation.

The first component is reasoning matched to task. Chain-of-thought for derivation-heavy tasks where the reasoning path is verifiable. ReAct for retrieval-dependent tasks where the agent needs to query live data. Reflection for high-stakes outputs where a verification pass is worth the overhead. Many production agents combine all three: a ReAct loop with chain-of-thought reasoning traces and a reflection pass on final outputs before delivery.

The second component is a governed context layer. Definitions certified by domain experts. Business rules with clear ownership and version history. Examples that reflect current data, not data from the last time someone updated the documentation. The reasoning technique cannot compensate for a context layer that is stale, conflicted, or ungoverned. Atlan AI Labs’ research on production deployments consistently finds that the gap between demo accuracy and production accuracy traces to context quality, not reasoning technique.

The third component is systematic evaluation. Not just “does the output look right” — but grounded evaluation against known correct answers, with failure analysis that distinguishes between reasoning errors (the technique failed) and knowledge errors (the context was wrong). Without this distinction, teams optimize reasoning when the problem is context, and vice versa.

How Atlan approaches reasoning-aware context infrastructure

The challenge

Enterprise AI agents fail in production for a predictable reason: they reason correctly from incorrect context. A ReAct agent with a sophisticated reasoning loop still retrieves the wrong definition from the semantic layer if the semantic layer is ungoverned. A chain-of-thought agent produces a logically impeccable SQL query that returns wrong results if the business glossary has conflicting definitions. The reasoning technique is not the failure point — the context infrastructure is.

Atlan’s research across production deployments found that the majority of agent failures traced to three context problems: stale definitions that had not been updated after a business change, conflicting definitions owned by different teams, and missing context that the agent silently substituted with a plausible guess.

The approach

Atlan addresses these problems at the context layer, not the reasoning layer. The Context Engineering Studio provides the infrastructure for governed context: certified definitions with version history, business glossary entries with clear ownership, and lineage that connects every data asset to the business process it supports.

For reasoning-aware deployments, Atlan surfaces context quality signals alongside the context itself. When an agent retrieves a definition, it also retrieves the certification status, the last review date, and the owner. This gives the reasoning layer — whether chain-of-thought, ReAct, or reflection — the information it needs to calibrate confidence appropriately.

The Atlan MCP server makes this context available to any agent framework — LangChain, CrewAI, custom implementations — without requiring the agent team to rebuild the context layer. The same governed definitions, lineage, and ownership metadata that power Atlan’s own agents are available as a context feed for any agent that needs to reason about enterprise data.

The outcome

When agents reason from governed context, the failure modes change. Instead of wrong answers produced confidently, teams see uncertainty surfaced explicitly — agents that say “this definition was last certified six months ago and may be stale” rather than silently applying an outdated rule. Instead of conflicting answers from different agents, teams see consistent outputs because all agents read from the same certified source.

Atlan AI Labs’ pilots with enterprise customers have consistently shown that improving context quality produces larger accuracy gains than improving reasoning technique on the same tasks. The reasoning technique matters — matching technique to task type is real work — but the ceiling on reasoning accuracy is set by context quality, not by the sophistication of the reasoning loop.

How enterprises ground reasoning in governed context

Workday

“We built a revenue analysis agent and it couldn’t answer one question. We started to realize we were missing this translation layer. All of the work that we did to get to a shared language amongst people at Workday can be leveraged by AI via Atlan’s MCP server.” — Joe DosSantos, VP Enterprise Data & Analytics, Workday

Workday built a revenue analysis agent that could not answer a basic business question, despite competent reasoning architecture. The gap was not technique selection. It was the absence of a translation layer between human language and the data’s structure. Building that layer and serving it via MCP let the agent’s reasoning operate on premises the business actually trusts.

DigiKey

“Atlan is much more than a catalog of catalogs. It’s more of a context operating system… Atlan enabled us to easily activate metadata for everything from discovery in the marketplace to AI governance to data quality to an MCP server delivering context to AI models.” — Sridher Arumugham, Chief Data & Analytics Officer, DigiKey

DigiKey treats the context layer as operating infrastructure that sits above any specific reasoning approach. Whether an agent uses CoT, ReAct, or reflection (or all three in different workflows), they all draw from the same governed context, which means the reasoning pattern is interchangeable but the premises are not.

Why reasoning technique is the second-order decision

The debate over chain-of-thought versus ReAct versus reflection is real but second-order. The first-order decision is whether the context the agent reasons from is accurate, current, and governed. Once that foundation exists, the choice of reasoning technique becomes a genuine optimization problem — matching the technique to the task type, managing token budgets, and combining techniques where the task demands it.

Without that foundation, the choice of reasoning technique is an optimization of the wrong variable. A ReAct agent with a sophisticated reasoning loop and a stale context layer will produce wrong answers faster and more expensively than a simpler agent with the same problem. Chain-of-thought will produce a more detailed wrong answer. Reflection will confirm that the wrong answer is internally consistent.

The teams making consistent progress on production AI reliability are the ones that sequence the work correctly: governed context first, reasoning technique second, systematic evaluation throughout.

Book a Demo

FAQs about LLM reasoning

1. What is chain-of-thought reasoning in LLMs?

Chain-of-thought prompts the model to generate explicit intermediate reasoning steps before producing a final answer. Rather than jumping from question to answer, the model works through the problem step by step, with each step becoming part of the context for the next. Wharton’s 2025 research found CoT adds 20-80% in time and tokens with marginal benefit on modern reasoning models that already reason effectively by default.

2. What is ReAct in AI agents?

ReAct (Reasoning + Acting) interleaves reasoning traces with tool calls in a Reason-Act-Observe loop. The agent reasons about what it needs, calls a tool to retrieve it, observes the result, incorporates the new information, and reasons about the next step. The original research showed ReAct outperforming baselines by 34% absolute on ALFWorld with only one or two in-context examples. ReAct is the dominant pattern for enterprise agents that need to query live data during reasoning.

3. Chain-of-thought vs ReAct: which should I use?

Use chain-of-thought when the agent can solve the problem from its existing context window and the task involves multi-step derivation where the reasoning path is verifiable — SQL generation, structured analysis, formal reasoning. Use ReAct when the agent needs to retrieve information during reasoning — data queries, research tasks, workflow automation where each step depends on the outcome of the previous one. Many production agents combine both: a ReAct loop with chain-of-thought reasoning traces at each step.

4. What is reflection in LLM agents?

Reflection prompts the model to critique its own output after generation. Self-reflection asks the generating model to evaluate its answer for errors and inconsistencies. External reflection uses a separate evaluator — a different model, a rule-based checker, or a human reviewer — to assess the output against ground truth. Self-reflection catches logical inconsistencies. External reflection catches factual errors outside the generating model’s knowledge boundary. Reflection adds latency and token cost, so it is typically reserved for high-stakes outputs where the verification overhead is justified.

5. How does token budget affect LLM reasoning?

Token budget constrains how many reasoning steps and tool calls an agent can execute. Each Reason-Act-Observe cycle in a ReAct loop consumes context window space. Budget-aware implementations track consumption at each step and adjust: shortening reasoning traces as the budget shrinks, batching tool calls where possible, and synthesizing from partial information when the budget approaches the limit. Ignoring token budget in production ReAct agents is a common cause of incomplete task execution and truncated outputs.

6. Why does better reasoning not fix bad context?

Reasoning techniques are model-internal strategies that improve how the agent thinks. They cannot create knowledge the agent does not have. An agent with a wrong definition of revenue will produce a logically structured, coherent, well-reasoned wrong answer regardless of whether it uses chain-of-thought, ReAct, or reflection. Self-reflection will confirm the wrong answer is internally consistent. ReAct will retrieve and incorporate additional data that supports the wrong premise. Chain-of-thought will show its work arriving at the wrong conclusion. Context quality is the upstream variable. Reasoning technique is the downstream amplifier.

Sources

Share this article

Atlan is the Context Layer for AI — a Leader in the Gartner Magic Quadrant for D&A Governance (2026) and the Forrester Wave for Data Governance (Q3 2025). Atlan unifies your data, business knowledge, and the meaning behind your terms into one Enterprise Data Graph that gives every team and every AI agent the trusted context they need. Trusted by Mastercard, Workday, General Motors, CME Group, HubSpot, FOX, Virgin Media O2, Elastic, and 400+ enterprises representing $10T+ in market cap.

Book a Demo Watch Context Agents Live