Chain-of-Thought Reasoning

In Short

Chain-of-thought (CoT) prompting asks a language model to produce intermediate reasoning steps before its final answer, not just the answer itself. This dramatically improves accuracy on multi-step math, logic, and commonsense tasks. It works in two flavors: few-shot (provide worked examples) and zero-shot (append "Let's think step by step").

01. What It Is

Chain-of-thought prompting is a technique that elicits intermediate reasoning steps from a language model as part of its response. Instead of prompting "Q: What is 30% of 180? A:", you prompt the model to show its work, producing a visible scratchpad that leads to the final answer.

The technique was formalized in Wei et al. (2022), "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models" (arXiv:2201.11903), authored by Jason Wei, Xuezhi Wang, Dale Schuurmans, and colleagues at Google. The paper demonstrated that providing a handful of worked examples containing explicit reasoning steps caused large language models to solve arithmetic, commonsense, and symbolic reasoning problems far more accurately than standard prompting.

A companion technique, zero-shot CoT, was introduced by Kojima et al. (2022) in "Large Language Models are Zero-Shot Reasoners" (arXiv:2205.11916). Their key finding: simply appending "Let's think step by step" to a question, with no examples, caused models to produce reasoning chains and improve accuracy on its own.

02. Why It Matters

Before CoT, large language models often failed at tasks requiring more than one inference step. A model trained to predict the next token could answer single-hop factual questions well, but struggled with problems like "If Tom has 3 apples and buys twice as many as he has, how many does he have?" Forced to jump from question to answer in one step, the model had to compress multi-step logic into a single output token, which it could not reliably do.

CoT matters because it offloads reasoning into the generation process itself. The model can "think on paper." Each generated step constrains what comes next, making errors visible and correctable. For tasks that mirror human problem-solving (math, logic, planning), this unlocks a qualitative capability improvement without any additional training.

03. How It Works

Few-shot CoT:
You include 4-8 demonstration examples in your prompt. Each example shows a question followed by a step-by-step solution, then the final answer. The model, seeing this pattern, applies the same structure to the new question.

Example prompt structure:

Q: Roger has 5 tennis balls. He buys 2 more cans of 3 balls each. How many does he have?
A: Roger starts with 5 balls. He buys 2 cans of 3 balls each, so 2 * 3 = 6 new balls. 5 + 6 = 11 balls. The answer is 11.

Q: [new question]
A:

Zero-shot CoT:
No examples needed. You simply append a trigger phrase such as "Let's think step by step" or "Think through this carefully" to the user prompt. This alone is sufficient to elicit reasoning steps from capable models.

Self-consistency (Wang et al., 2022): A refinement of few-shot CoT where you sample multiple independent reasoning chains for the same question (using temperature > 0), then take a majority vote on the final answers. This works because complex problems admit multiple valid reasoning paths to the same correct answer. If many independent paths converge on the same result, that result is more likely to be correct. Wang et al. showed improvements of +17.9% on GSM8K, +11.0% on SVAMP, and +12.2% on AQuA over single-chain CoT.

Auto-CoT (Zhang et al., 2022): Eliminates manual example creation by clustering questions by diversity, then automatically generating reasoning chains via zero-shot CoT. The best chains are selected by heuristics (length, number of steps) and used as few-shot examples.

Emergent behavior:
Wei et al. found that CoT is an emergent ability. Small models (under ~10B parameters) show no benefit or even degradation from CoT prompting. The gains appear primarily in models with 100B+ parameters (at 2022 scales), though post-training alignment in smaller models has since shifted this threshold downward.

04. Key Terms / Benchmarks

GSM8K:
8,500 grade-school math word problems. Wei et al. used 8 CoT exemplars with a 540B model to reach state-of-the-art, outperforming fine-tuned GPT-3 with a verifier.

SVAMP:
Arithmetic word problems designed to test robustness. Self-consistency CoT improved accuracy by +11.0% over single-chain CoT.

AQuA-RAT:
Algebraic word problems. Self-consistency improved by +12.2%.

StrategyQA:
Commonsense multi-hop questions. Self-consistency added +6.4%.

Tree-of-Thought (ToT):
A 2023 extension that generates multiple reasoning paths as a tree, explores branches, and prunes dead ends. Goes beyond linear CoT chains.

Scratchpad:
General term for any intermediate computation generated before a final answer, whether in CoT prompting or in reasoning models.

05. Examples

Zero-shot CoT (Kojima et al. style):

Q: A bat and a ball cost $1.10 in total. The bat costs $1.00 more than the ball. How much does the ball cost?
Let's think step by step.

A: Let the ball cost x cents. The bat costs x + 100 cents. Together: x + (x + 100) = 110. So 2x = 10, x = 5. The ball costs 5 cents.

Without "Let's think step by step," many models answered $0.10 (the intuitive wrong answer).

Few-shot CoT on commonsense:

Q: Would a pear sink in water?
A: The density of a pear is about 0.6 g/cm3. Water has a density of 1 g/cm3. Objects less dense than water float. Since 0.6 < 1, a pear would float, not sink.

06. Common Pitfalls / Misconceptions

CoT is not the same as reasoning models. CoT prompting is a prompting technique applied to a standard language model. Reasoning models (OpenAI o-series, DeepSeek R1, Claude with extended thinking) are trained via reinforcement learning to perform extended internal computation. They generate thinking tokens as part of their training objective, not because a human typed "let's think step by step."

CoT has diminishing value on modern models:
The 2025 Wharton GAIL tech report by Meincke, Mollick, Mollick, and Shapiro (arXiv:2506.07142) found that explicit CoT prompting provides modest average improvement on non-reasoning models (+4.4% to +13.5%) but negligible gains on reasoning models (+2.9% to +3.1%), with one model showing a -3.3% drop. The time cost (35-600% longer responses) often does not justify the gain. Many current models perform CoT-like reasoning implicitly by default.

CoT can introduce inconsistency:
On questions a model already answers correctly, forcing CoT can cause it to talk itself into the wrong answer by over-reasoning.

Longer chains are not always better:
More reasoning steps can drift from the question or amplify early errors. Quality and direction of reasoning steps matter more than quantity.

The visible reasoning may not reflect actual computation:
Anthropic research found that in reasoning models, inserted hints appeared in visible reasoning chains only 25-39% of the time, suggesting displayed reasoning can be post-hoc rationalization rather than the actual computation path.

Chain-of-Thought Reasoning

In Short

01. What It Is

02. Why It Matters

03. How It Works

04. Key Terms / Benchmarks

05. Examples

06. Common Pitfalls / Misconceptions

Verified against primary sources

Key terms

Tags

Sources

More in Reasoning