Reinforcement Learning

In Short

Reinforcement learning (RL) is the branch of machine learning where an agent learns by taking actions in an environment and receiving reward or penalty signals, with no labeled dataset. It underpins game-playing AI (AlphaGo, AlphaStar), robot control, and, most relevantly for language models, RLHF: the training phase that turns a raw language model into a helpful assistant.

01. What It Is

Reinforcement learning is one of the three core machine learning paradigms alongside supervised and unsupervised learning. The key distinction: there is no labeled training set mapping inputs to correct outputs. Instead, an agent interacts with an environment over a sequence of time steps. At each step the agent observes the current state, chooses an action, receives a reward signal, and transitions to a new state. The goal is to learn a policy: a strategy for selecting actions that maximizes cumulative reward over time.

The mathematical framework underlying most RL is the Markov decision process (MDP), formalized in the 1950s by Richard Bellman. An MDP is defined by a set of states, a set of actions, a transition function (probability of reaching state s' from state s after action a), and a reward function. "Markov" means the next state depends only on the current state and action, not on the full history.

02. Why It Matters

Supervised learning requires humans to label every training example. For many tasks, that is impossible or insufficient. No human can label every chess position as "good" or "bad" better than optimal play would require. No human can label every robot joint angle as correct or incorrect fast enough to train manipulation skills. And for language model alignment, the ideal feedback is not a correct next-token label but a human judgment about whether the full response was helpful, honest, and harmless.

RL solves this class of problem by treating the signal itself as the supervisor: win/lose, reward/penalty, human preference rating. This makes it applicable wherever the quality of a decision sequence can be evaluated, even if the correct decision at each individual step cannot be labeled in advance.

03. How It Works

States, actions, rewards, and policies

At time step t, the agent observes state s_t and chooses action a_t according to its policy. The environment responds with a reward r_t and new state s_{t+1}. The agent's objective is to maximize the expected cumulative discounted reward: the sum of future rewards, with rewards further in the future discounted by a factor gamma (between 0 and 1) to reflect their uncertainty.

A policy maps states to actions (or to probability distributions over actions for stochastic policies). A deterministic policy always picks the same action in a given state. A stochastic policy samples from a probability distribution, which supports exploration.

Value functions

A value function estimates how good a state is in terms of expected future reward, following the current policy. The state-value function V(s) is the expected return starting from state s. The action-value function Q(s, a) is the expected return starting from state s, taking action a, and then following the policy. Q-values are the foundation of Q-learning and related algorithms.

The Bellman equation expresses a recursive relationship: V(s) equals the expected immediate reward plus the discounted value of the next state. Solving these equations gives the optimal value function and, from it, the optimal policy.

Q-learning

Q-learning (Watkins, 1989) estimates the Q-value function directly from experience without requiring a model of the environment. The agent maintains a table (or neural network approximation) of Q-values for each (state, action) pair. After each transition, it updates the Q-value using the Bellman backup: move the current Q-value toward the reward plus the discounted maximum Q-value of the next state. With enough exploration, Q-values converge to their optimal values.

Deep Q-Networks (DQN, DeepMind 2013-2015) replaced the Q-table with a convolutional neural network, allowing Q-learning to scale to high-dimensional inputs like raw game pixels. DQN achieved human-level performance on 49 Atari games using only pixel inputs and game scores.

Policy gradients

Rather than estimating value functions and deriving a policy, policy gradient methods directly optimize the policy itself. The policy is parameterized (e.g., by a neural network) and gradient ascent is used to increase the probability of actions that led to high rewards. The REINFORCE algorithm (Williams, 1992) is the basic form.

Proximal Policy Optimization (PPO), introduced by OpenAI in 2017, is currently the dominant policy gradient algorithm. PPO avoids the instability of naive policy gradient updates by clipping the ratio of the new policy to the old policy within a trust region, preventing excessively large updates that would destabilize training. PPO is used in the RLHF pipeline that trains ChatGPT, Claude, and similar systems.

Exploration vs. exploitation

The core tension in RL is the exploration-exploitation tradeoff. An agent that always exploits its current best strategy never discovers better strategies it has not tried. An agent that always explores randomly never uses what it has learned. Common strategies include epsilon-greedy (take a random action with probability epsilon, otherwise exploit), upper confidence bound (UCB, favor actions with high uncertainty), and entropy bonuses in policy gradient methods (reward the policy for maintaining diversity).

04. Key Terms

Agent: The learner and decision-maker. Environment: Everything the agent interacts with. The environment is the world. The agent is the learner within it. State: A description of the environment at a given moment. Action: A choice the agent makes in a state. Reward: A scalar signal the environment returns after each action, indicating how good or bad the action was. Policy: The agent's strategy: a mapping from states to actions. MDP (Markov decision process): The mathematical framework defining states, actions, transitions, and rewards. Value function: Expected cumulative future reward from a state (V) or state-action pair (Q). Q-value: The expected return starting from state s and taking action a: Q(s, a). Bellman equation: A recursive equation that defines value functions in terms of immediate reward plus future value. PPO: Proximal Policy Optimization. The policy gradient algorithm used in most modern RLHF pipelines. RLHF: Reinforcement learning from human feedback. Uses human preference ratings as the reward signal.

05. Examples

DeepMind's AlphaGo (2016) used a combination of supervised learning on human games and self-play RL to defeat the world Go champion. The game of Go has more board states than atoms in the observable universe, making exhaustive search impossible. RL allowed the agent to improve beyond any human-labeled training data through self-play.

OpenAI's Dota 2 agent (OpenAI Five, 2019) learned complex team strategy purely from self-play RL with a sparse win/loss reward, eventually defeating world champion teams.

For LLMs, RLHF works as follows. A reward model is trained on human comparisons (annotators rate which of two model responses is better). The language model is then fine-tuned using PPO with the reward model's score as the reward signal. This is what converts a next-token predictor into an assistant that reliably follows instructions and avoids harmful outputs.

06. Common Pitfalls and Misconceptions

"RL just means trial and error."
RL is mathematically principled. The Bellman equations, value function estimation, and policy gradient theorems are rigorous. "Trial and error" describes the data collection process, not the algorithm.

"RL always needs a simulator."
Model-free RL (Q-learning, PPO) learns directly from environment interactions without a model. Many real-world RL deployments, including RLHF, use actual environment rollouts (or human raters) rather than simulators.

"RLHF makes the model safe."
RLHF aligns the model to human rater preferences, which reduces certain failure modes. It does not guarantee safety. Reward hacking (exploiting loopholes in the reward model that do not reflect actual human preferences) is a documented failure mode.

"Exploration is solved."
Finding the right balance between exploration and exploitation in sparse-reward and high-dimensional environments remains an open research problem.