03. How It Works
Neural networks
A neural network is a mathematical function built from layers of simple units called neurons. Each neuron takes several numeric inputs, multiplies each input by a weight (a number representing how much that input matters), sums the results, adds a bias term, and then passes the sum through an activation function that introduces non-linearity. Without activation functions, stacking layers would be equivalent to a single linear transformation and the network would have no more expressive power than basic regression.
Common activation functions include ReLU (which outputs zero for negative inputs and the input value for positive ones), sigmoid (which squashes outputs to the 0-1 range, useful for probability outputs), and softmax (which converts a vector of raw scores into a probability distribution over multiple classes).
A network has an input layer, one or more hidden layers, and an output layer. Information flows forward from input to output in a forward pass. The depth of a network refers to its number of layers. Deep learning simply means the network has many hidden layers, typically tens to hundreds in modern architectures.
Training as optimization
Training is the process of finding the weight values that make the network's outputs as accurate as possible across a labeled dataset. The quality of a prediction is measured by a loss function (also called a cost function). For a classification problem this is typically cross-entropy loss. For regression it is often mean squared error. Lower loss means better predictions.
The goal is to minimize the loss. Gradient descent is the standard algorithm: compute the gradient of the loss with respect to every weight (the gradient points in the direction of steepest increase), then nudge each weight a small step in the opposite direction. The size of that step is the learning rate. Repeat across many batches of data until loss plateaus.
Backpropagation is the algorithm that computes those gradients efficiently. After a forward pass produces a prediction and the loss is computed, backpropagation uses the chain rule of calculus to propagate the error signal backward through each layer, computing each weight's contribution to the overall loss. Modern frameworks like PyTorch and TensorFlow handle this automatically via automatic differentiation.
Supervised, unsupervised, and reinforcement learning
Supervised learning trains on labeled examples: input-output pairs where a human has annotated the correct answer. The model learns a mapping from inputs to outputs. Image classifiers, spam detectors, and most LLM fine-tuning fall here.
Unsupervised learning finds structure in data without labels. Clustering algorithms group similar items. Autoencoders learn compressed representations of inputs. Generative models learn the underlying distribution of the data. Much of LLM pre-training is technically self-supervised (the model predicts the next token from the preceding ones) which is a form of unsupervised learning that manufactures its own labels from raw data.
Reinforcement learning trains an agent that takes actions in an environment and receives reward or penalty signals over time. There is no labeled dataset. The agent learns a policy that maximizes cumulative reward through trial and error. AlphaGo used RL. RLHF (reinforcement learning from human feedback) is used to align LLMs to human preferences after initial pre-training.
Overfitting vs. generalization
A model that fits its training data too closely memorizes noise and specific examples rather than learning the underlying pattern. It scores well on training data but fails on new inputs. This is overfitting. The opposite failure, underfitting, is a model too simple to capture the pattern even in the training data.
Generalization is the ability to perform well on new data not seen during training. Techniques to improve generalization include dropout (randomly deactivating neurons during training to prevent co-adaptation), weight regularization (penalizing large weight values), data augmentation (artificially expanding the training set), and early stopping (halting training before the model over-memorizes).