Chain-of-Thought Prompting: What the Research Actually Shows
Chain-of-thought prompting has become a standard technique for improving LLM reasoning on complex tasks. We review the key findings, explain where the gains come from, and identify the limits the results do not always surface.
AI with human touch
12 April 2025
TL;DR
- →Chain-of-thought (CoT) prompting — asking a model to 'think step by step' — reliably improves performance on multi-step reasoning tasks in larger models
- →The technique works by inducing intermediate reasoning steps that allow the model to decompose complex problems before producing an answer
- →CoT gains are most reliable on arithmetic, commonsense reasoning, and symbolic manipulation tasks; they are more variable on knowledge-intensive and factual tasks
- →The technique is less effective on smaller models and shows limited benefit for simple classification tasks where no reasoning chain is needed
- →Self-consistency — sampling multiple reasoning chains and majority-voting on answers — typically improves CoT performance further at the cost of additional inference compute
Chain-of-thought (CoT) prompting is one of the most widely replicated techniques in applied LLM research. The core finding, introduced by Wei et al. in 2022, is that prompting a model to produce intermediate reasoning steps before its final answer — rather than answering directly — substantially improves performance on tasks requiring multi-step reasoning. The technique has become a default in both research benchmarks and production systems. Understanding what it actually does, and where its limits are, matters for anyone applying it in practice.
The Core Mechanism
When a model produces a chain of thought, it generates a sequence of intermediate steps — effectively working through a problem incrementally in its own output space. The hypothesis is that this externalised reasoning process allows the model to decompose a problem into tractable sub-problems before committing to a final answer, in a way that direct single-step response does not.
This is particularly valuable for arithmetic and multi-step reasoning problems, where errors compound: a wrong intermediate step in a direct answer is invisible, while a wrong intermediate step in a chain of thought is visible and partially correctable. The model's own generated context also serves as a form of working memory for complex reasoning chains.
What the Research Demonstrates
The original Wei et al. paper showed CoT prompting improved performance on arithmetic word problems, commonsense reasoning benchmarks, and symbolic manipulation tasks. Crucially, the gains were most pronounced in larger models — the technique provided minimal benefit on models below roughly 100B parameters. This suggested that the capacity to produce coherent reasoning chains is itself an emergent property of scale.
Subsequent work has confirmed the core findings while identifying nuances. Kojima et al. (2022) showed that zero-shot CoT — simply appending "Let's think step by step" without few-shot examples — also produces meaningful gains, making the technique more accessible. Self-consistency (Wang et al., 2022), which samples multiple CoT paths and selects the most common final answer, consistently improves results further, at the cost of additional inference.
Where the Limits Appear
CoT is not uniformly beneficial. On knowledge-intensive tasks — questions where the answer depends on factual recall rather than reasoning — CoT can introduce errors by giving the model more surface area to confabulate plausible-sounding but incorrect intermediate steps. The reasoning chain looks correct; the facts in it are wrong.
Additionally, CoT performance on benchmarks may not transfer cleanly to novel reasoning tasks. There is evidence that models can learn to mimic the form of a reasoning chain without performing genuine step-by-step reasoning — producing answers that look like they were derived from the chain, but were not. Distinguishing genuine reasoning decomposition from pattern matching on chain-of-thought format remains an open research question.
Practical Implications
For practitioners, the reliable guidance from the literature is: use CoT when the task involves multiple explicit steps, when intermediate steps can be verified, and when you are working with a model large enough to produce coherent chains. Avoid expecting CoT to compensate for factual gaps — if the model doesn't know something, a reasoning chain won't supply the missing knowledge. Self-consistency is a useful reliability improvement when accuracy is critical and inference cost is secondary.
Key Takeaways
- CoT prompting reliably improves LLM performance on multi-step arithmetic, commonsense reasoning, and symbolic tasks
- The technique works by externalising intermediate reasoning, allowing problem decomposition before the final answer
- Gains are most pronounced in larger models; smaller models show limited benefit
- CoT can introduce errors on knowledge-intensive tasks by creating more surface area for confabulation
- Self-consistency (multiple chain sampling + majority vote) improves CoT reliability at the cost of additional inference compute