Reasoning model


A reasoning model, also known as reasoning language models or large reasoning models, is a type of large language model that has been specifically trained to solve complex tasks requiring multiple steps of logical reasoning. These models demonstrate superior performance on logic, mathematics, and programming tasks compared to standard LLMs. They possess the ability to revisit and revise earlier reasoning steps and utilize additional computation during inference as a method to scale performance, complementing traditional scaling approaches based on training data size, model parameters, and training compute.

Overview

Unlike traditional language models that generate responses immediately, reasoning models allocate additional compute, or thinking, time before producing an answer to solve multi-step problems. OpenAI introduced this terminology in September 2024 when it released the o1 series, describing the models as designed to "spend more time thinking" before responding. The company framed o1 as a reset in model naming that targets complex tasks in science, coding, and mathematics, and it contrasted o1's performance with GPT-4o on benchmarks such as AIME and Codeforces. Independent reporting the same week summarized the launch and highlighted OpenAI's claim that o1 automates chain-of-thought style reasoning to achieve large gains on difficult exams.
In operation, reasoning models generate internal chains of intermediate steps, then select and refine a final answer. OpenAI reported that o1's accuracy improves as the model is given more reinforcement learning during training and more test-time compute at inference. The company initially chose to hide raw chains and instead return a model-written summary, stating that it "decided not to show" the underlying thoughts so researchers could monitor them without exposing unaligned content to end users. Commercial deployments document separate "reasoning tokens" that meter hidden thinking and a control for "reasoning effort" that tunes how much compute the model uses. These features make the models slower than ordinary chat systems while enabling stronger performance on difficult problems.

History

The research trajectory toward reasoning models combined advances in supervision, prompting, and search-style inference.
Early alignment work on reinforcement learning from human feedback showed that models can be fine-tuned to follow instructions with "human feedback" and preference-based rewards. In 2022, Google Research scientists Jason Wei and Denny Zhou showed that chain-of-thought prompting "significantly improves the ability" of large models on complex reasoning tasks.
A companion result demonstrated that the simple instruction "Let's think step by step" can elicit zero-shot reasoning. Follow-up work introduced self-consistency decoding, which "boosts the performance" of chain-of-thought by sampling diverse solution paths and choosing the consensus, and tool-augmented methods such as ReAct, a portmanteau of Reason and Act, that prompt models to "generate both reasoning traces" and actions. Research then generalized chain-of-thought into search over multiple candidate plans. The Tree-of-Thoughts framework from Princeton computer scientist Shunyu Yao proposes that models "perform deliberate decision making" by exploring and backtracking over a tree of intermediate thoughts.
OpenAI's reported breakthrough focused on supervising reasoning processes rather than only outcomes, with Lightman et al.'s "Let's Verify Step by Step" reporting that rewarding each correct step "significantly outperforms outcome supervision" on challenging math problems and improves interpretability by aligning the chain-of-thought with human judgment. OpenAI's o1 announcement ties these strands together with a large-scale reinforcement learning algorithm that trains the model to refine its own chain of thought, and it reports that accuracy rises with more training compute and more time spent thinking at inference.
Together, these developments define the core of reasoning models. They use supervision signals that evaluate the quality of intermediate steps, they exploit inference-time exploration such as consensus or tree search, and they expose controls for how much internal thinking compute to allocate. OpenAI's o1 family made this approach available at scale in September 2024 and popularized the label "reasoning model" for LLMs that deliberately think before they answer.
The development of reasoning models illustrates Richard S. Sutton's "bitter lesson" that scaling compute typically outperforms methods based on human-designed insights. This principle was demonstrated by researchers at the Generative AI Research Lab, who initially attempted to replicate o1's capabilities using sophisticated methods including tree search and reinforcement learning in late 2024. Their findings, published in the "o1 Replication Journey" series, revealed that knowledge distillation, a comparatively straightforward technique that trains a smaller model to mimic o1's outputs, produced unexpectedly strong performance. This outcome illustrated how direct scaling approaches can, at times, outperform more complex engineering solutions.

Drawbacks

Reasoning models require significantly more computational resources during inference compared to non-reasoning models. Research on the American Invitational Mathematics Examination benchmark found that reasoning models were 10 to 74 times more expensive to operate than their non-reasoning counterparts. The extended inference time is attributed to the detailed, step-by-step reasoning outputs that these models generate, which are typically much longer than responses from standard large language models that provide direct answers without showing their reasoning process.
One researcher in early 2025 argued that these models may face potential additional denial-of-service concerns with "overthinking attacks."

Releases

2024

In September 2024, OpenAI released o1-preview, a large language model with enhanced reasoning capabilities. The full version, o1, was released in December 2024. OpenAI initially shared preliminary results on its successor model, o3, in December 2024, with the full o3 model becoming available in 2025.
Alibaba released reasoning versions of its Qwen large language models in November 2024. In December 2024, the company introduced QvQ-72B-Preview, an experimental visual reasoning model.
In December 2024, Google introduced Deep Research in Gemini, a feature designed to conduct multi-step research tasks.
On December 16, 2024, researchers demonstrated that by scaling test-time compute, a relatively small Llama 3B model could outperform a much larger Llama 70B model on challenging reasoning tasks. This experiment suggested that improved inference strategies can unlock reasoning capabilities even in smaller models.

2025

In January 2025, DeepSeek released R1, a reasoning model that achieved performance comparable to OpenAI's o1 at significantly lower computational cost. The release demonstrated the effectiveness of Group Relative Policy Optimization, a reinforcement learning technique used to train the model.
On January 25, 2025, DeepSeek enhanced R1 with web search capabilities, allowing the model to retrieve information from the internet while performing reasoning tasks.
Research during this period further validated the effectiveness of knowledge distillation for creating reasoning models. The s1-32B model achieved strong performance through budget forcing and scaling methods, reinforcing findings that simpler training approaches can be highly effective for reasoning capabilities.
On February 2, 2025, OpenAI released Deep Research, a feature powered by their o3 model that enables users to conduct comprehensive research tasks. The system generates detailed reports by automatically gathering and synthesizing information from multiple web sources.
OpenAI called GPT-4.5 its "last non-chain-of-thought model", and implemented with GPT-5 a router that selects a model based on the difficulty of the task.

Training

Reasoning models follow the familiar large-scale pretraining used for frontier language models, then diverge in the post-training and optimization. OpenAI reports that o1 is trained with a large-scale reinforcement learning algorithm that teaches the model to use and refine a chain of thought before answering. The company emphasizes two coupled levers, more reinforcement learning during training and more time spent thinking at inference, and it documents smooth gains as each increases. OpenAI also states that it decided not to show raw chains to end users and instead returns a model-written summary, a product choice tied to safety monitoring and competitive concerns.
A central ingredient is process supervision, which rewards intermediate steps rather than only the final answer. OpenAI's study introduced a process reward model trained on step-level labels and found that process supervision significantly outperforms outcome-only supervision on challenging math problems. The project also released the PRM800K step-level feedback dataset and argued that process-level rewards improve interpretability because humans can check each step. These results supplied a practical recipe for supervising chains of thought that was later scaled into production training.
This training differs in important ways from traditional frontier models that do not target reasoning. Standard systems are pretrained on internet-scale corpora with a next-token prediction objective, then aligned through instruction tuning and preference optimization. The canonical InstructGPT recipe first uses supervised fine-tuning on human demonstrations, then trains a reward model from pairwise preferences, and finally optimizes the policy with reinforcement learning, typically PPO with a KL penalty. Variants such as direct preference optimization remove the explicit RL step and optimize the model directly on preference data, but the supervision target is still the final outcome judged by raters rather than the quality of internal steps. Technical reports for GPT-4 summarize this conventional pipeline as next-token pretraining followed by RLHF-style post-training to shape behavior.
In contrast, reasoning models are optimized to produce, critique, and revise multi-step chains during training. OpenAI states that reinforcement learning is applied to the chain itself, which teaches the model to recognize mistakes, break problems into simpler steps, and switch strategies when the current approach fails. OpenAI also documents that it hides chains at inference and returns an answer that summarizes useful ideas from the internal trace. These design choices reflect the model's training objective and its intended monitoring.
Zelikman et al. introduced STaR, which explored bootstrapping rationales by iteratively generating and filtering chains, then fine-tuning on those traces, and they reported gains over outcome-only fine-tuning. One variant of this method supplied additional mechanisms for producing training signals that speak to intermediate reasoning, not only final answers.
DeepSeek reported R1 and R1-Zero systems trained with pure RL to elicit long chains, self-verification, and reflection, arguing that explicit chain-level rewards can induce general reasoning behaviors. These results indicate that post-training focused on chain quality has become a distinct regime separate from outcome-only alignment.