Skip to content

Reasoning Models and Test-Time Compute

Making AI Useful 6 min read

In Short

Reasoning models are language models trained via reinforcement learning to generate extended internal "thinking" before producing an answer, spending additional compute at inference time rather than just during training. This approach, pioneered by OpenAI's o-series and replicated by DeepSeek R1, Claude, and Gemini, dramatically improves accuracy on hard math, coding, and science tasks at the cost of higher latency and token usage.

100%

Scroll to pan · Ctrl/Cmd + scroll to zoom · drag to pan · double-click to fit

A reasoning model adds an internal thinking step between the prompt and the answer, spending test-time compute to self-correct before responding while a standard model answers in one forward pass. The self-correction behaviors are learned during reinforcement-learning training, not added at inference.

01. What It Is

Reasoning models are a class of large language models that allocate significant compute at inference time by generating intermediate thinking steps, called thinking tokens or reasoning tokens, before producing a final visible response. This is distinct from standard language models, which produce output tokens in a single forward pass with no explicit internal deliberation.

The key innovation is where computation happens. Standard model improvements come from scaling training compute (bigger models, more data). Reasoning models add a third axis: test-time compute, meaning the model "thinks longer" on hard problems at the moment of inference. This paradigm was popularized by OpenAI's o1 model, released September 2024, and has since become mainstream across all major AI labs.

By 2026, inference workloads are projected to account for two-thirds of all AI compute, driven substantially by reasoning model adoption.

02. Why It Matters

Chain-of-thought prompting (CoT) can elicit step-by-step reasoning from a standard model, but it is fragile: the reasoning quality depends on prompting, the model is not specifically trained to reason, and it can fail in inconsistent ways.

Reasoning models bake the reasoning process into training. The model learns, through reinforcement learning, that thinking carefully before answering leads to higher rewards. This produces qualitatively different behavior: the model self-corrects, backtracks, explores alternative approaches, and verifies its own work, all within the generation process.

For tasks with verifiable correct answers (math proofs, code that passes tests, chess moves), this produces massive accuracy gains. ARC-AGI-2 illustrates the gap: at launch, standard LLMs scored 0% while reasoning models scored in the low single digits, with scores improving substantially as labs iterated on their systems.

03. How It Works

Training via reinforcement learning: Reasoning models are trained with RL algorithms that reward correct final answers, not next-token likelihood. DeepSeek-R1 used Group Relative Policy Optimization (GRPO), which optimizes the policy without a separate critic model, reducing memory use by 40-60% compared to PPO. The model receives a reward signal only when its final answer is verified correct, which forces it to develop strategies for arriving at correct answers reliably.

Emergent reasoning behaviors: DeepSeek-R1-Zero, trained via pure RL with no supervised fine-tuning on reasoning examples, spontaneously developed self-reflection, backtracking, and strategy switching. These behaviors were not explicitly programmed but emerged from the reward signal alone. AIME 2024 pass@1 improved from 15.6% to 71.0% during training.

Thinking tokens: During inference, the model generates a scratchpad of internal reasoning before the visible response. These tokens are consumed during generation and either hidden from the user (OpenAI's private chain of thought) or exposed in tagged blocks (DeepSeek's <think> tags, Claude's thinking content blocks). You are billed for thinking tokens even when they are not shown.

Logarithmic scaling: Test-time compute follows a logarithmic scaling law. Performance improves as log(thinking tokens). The first 1,000 thinking tokens provide the largest accuracy gain. Going from 10,000 to 11,000 tokens helps much less. A smaller model that thinks for 10 seconds can outperform a larger model answering instantly.

Process vs. outcome rewards: Two reward approaches exist. Outcome Reward Models (ORMs) give a single reward based on whether the final answer is correct. Process Reward Models (PRMs) evaluate each reasoning step, which is 1.5-5x more compute-efficient for training but requires step-level annotation.

04. Key Terms / Benchmarks

AIME (American Invitational Mathematics Examination): Competitive high-school math. DeepSeek-R1 scored 79.8% pass@1 on AIME 2024 (DeepSeek-R1-Zero, the pure-RL model without SFT, scored 71.0%). OpenAI o3 scored 88.9% on AIME 2025.

MATH-500: 500 competition mathematics problems. DeepSeek-R1 achieved 97.3%.

GPQA Diamond: 198 PhD-level questions in hard sciences, where non-expert humans score 34%. DeepSeek-R1 scored 71.5%. Leading frontier models in 2025-2026 have pushed scores above 90%.

ARC-AGI-2: Abstract reasoning tasks requiring novel generalization. At launch (March 2025), standard frontier models scored near 0-4%. Reasoning models have improved substantially since; the leaderboard at arcprize.org tracks current scores.

SWE-bench: Real GitHub issue resolution. o3 leads coding benchmarks; o4-mini optimized for cost-efficient coding tasks.

Thinking budget / budget_tokens: The parameter in Claude's API controlling how many tokens Claude can use for internal reasoning before responding. Deprecated in favor of adaptive thinking on Opus 4.6 and Sonnet 4.6.

Adaptive thinking: Claude's current approach (Opus 4.6+, Sonnet 4.6+) where the model automatically decides when and how deeply to think based on request complexity, rather than requiring a manually specified token budget.

GRPO (Group Relative Policy Optimization): DeepSeek's RL algorithm for training R1. Removes the need for a separate critic model, cutting training cost dramatically.

MoE (Mixture of Experts): Architecture used by DeepSeek R1. The full model has 671B parameters but activates only 37B per token, keeping inference cost low despite large total parameter count.

05. Examples

OpenAI o-series:

  • o1 (September 2024): First publicly released reasoning model with hidden chain of thought.
  • o3 (April 2025 GA): Major capability leap. Scored 87.5% (high-compute) and 75.7% (high-efficiency) on ARC-AGI-1, and 96.7% on AIME 2024. 88.9% on AIME 2025.
  • o4-mini (April 2025): Cost-optimized reasoning, strong on coding and math.
  • o3-pro (June 2025): Extended thinking budget for the most demanding tasks.

DeepSeek R1 (January 2025): Open-weight 671B MoE model (37B active parameters). 79.8% pass@1 on AIME 2024, 97.3% on MATH-500. Matched OpenAI o1 performance at a fraction of the API cost. MIT license, free for commercial use. Transparent <think> reasoning blocks. R1-0528 (May 2025) pushed capabilities further with more training compute.

Claude extended thinking / adaptive thinking: Claude 3.7 Sonnet introduced developer-controlled budget_tokens. Claude Opus 4.6 and Sonnet 4.6 moved to adaptive thinking, where the model auto-calibrates reasoning depth. Thinking blocks are visible in the API response. Developers can set display: "omitted" to suppress thinking in output while still paying for thinking tokens billed.

Gemini thinking: Gemini 2.5 and 3 series use dynamic thinking mode that auto-adjusts to task complexity. Gemini 3 Flash provides pro-grade reasoning at fast speeds. A thinkingLevel parameter allows explicit depth control for developers.

06. Common Pitfalls / Misconceptions

Reasoning models are not always better:
They are best for verifiable, multi-step problems. For simple factual retrieval, translation, summarization, or any task with sub-second latency requirements, standard models are faster, cheaper, and equally accurate. Approximately 80% of typical production queries do not benefit from extended thinking.

The visible reasoning is not necessarily the real reasoning:
Anthropic research found that hints inserted into the context appeared in visible reasoning chains only 25-39% of the time. The displayed thinking can function as post-hoc rationalization rather than a faithful transcript of the model's computation.

Test-time compute is not free scaling:
Costs are variable and hard to predict. A complex math query may generate 10,000-50,000 thinking tokens. At reasoning model pricing, that adds up quickly at scale. Latency ranges from 5 to 60+ seconds for complex problems. Check your provider's current pricing page for up-to-date token costs.

CoT prompting is not the same thing:
Adding "let's think step by step" to a prompt instructs a standard model to mimic reasoning in its output. Reasoning models have reasoning trained in via RL. The difference is the training objective and the reliability of the resulting behavior.

Reasoning models can overthink:
A 2025 Wharton study found that applying explicit CoT prompting to reasoning models (which already reason internally) added 2.9-3.1% accuracy at best, and caused -3.3% degradation in one case, while adding 20-80% more latency.

Open reasoning does not guarantee open training recipes:
DeepSeek R1 is open-weight (you can download and run the model), but OpenAI's o-series is closed. Neither has fully published training data or complete training code.