03. How It Works
States, actions, rewards, and policies
At time step t, the agent observes state s_t and chooses action a_t according to its policy. The environment responds with a reward r_t and new state s_{t+1}. The agent's objective is to maximize the expected cumulative discounted reward: the sum of future rewards, with rewards further in the future discounted by a factor gamma (between 0 and 1) to reflect their uncertainty.
A policy maps states to actions (or to probability distributions over actions for stochastic policies). A deterministic policy always picks the same action in a given state. A stochastic policy samples from a probability distribution, which supports exploration.
Value functions
A value function estimates how good a state is in terms of expected future reward, following the current policy. The state-value function V(s) is the expected return starting from state s. The action-value function Q(s, a) is the expected return starting from state s, taking action a, and then following the policy. Q-values are the foundation of Q-learning and related algorithms.
The Bellman equation expresses a recursive relationship: V(s) equals the expected immediate reward plus the discounted value of the next state. Solving these equations gives the optimal value function and, from it, the optimal policy.
Q-learning
Q-learning (Watkins, 1989) estimates the Q-value function directly from experience without requiring a model of the environment. The agent maintains a table (or neural network approximation) of Q-values for each (state, action) pair. After each transition, it updates the Q-value using the Bellman backup: move the current Q-value toward the reward plus the discounted maximum Q-value of the next state. With enough exploration, Q-values converge to their optimal values.
Deep Q-Networks (DQN, DeepMind 2013-2015) replaced the Q-table with a convolutional neural network, allowing Q-learning to scale to high-dimensional inputs like raw game pixels. DQN achieved human-level performance on 49 Atari games using only pixel inputs and game scores.
Policy gradients
Rather than estimating value functions and deriving a policy, policy gradient methods directly optimize the policy itself. The policy is parameterized (e.g., by a neural network) and gradient ascent is used to increase the probability of actions that led to high rewards. The REINFORCE algorithm (Williams, 1992) is the basic form.
Proximal Policy Optimization (PPO), introduced by OpenAI in 2017, is currently the dominant policy gradient algorithm. PPO avoids the instability of naive policy gradient updates by clipping the ratio of the new policy to the old policy within a trust region, preventing excessively large updates that would destabilize training. PPO is used in the RLHF pipeline that trains ChatGPT, Claude, and similar systems.
Exploration vs. exploitation
The core tension in RL is the exploration-exploitation tradeoff. An agent that always exploits its current best strategy never discovers better strategies it has not tried. An agent that always explores randomly never uses what it has learned. Common strategies include epsilon-greedy (take a random action with probability epsilon, otherwise exploit), upper confidence bound (UCB, favor actions with high uncertainty), and entropy bonuses in policy gradient methods (reward the policy for maintaining diversity).