Machine Learning Basics

In Short

Machine learning is the branch of AI where systems learn patterns from data rather than following hand-coded rules. Deep learning is a subset of machine learning that uses layered neural networks, and that layered architecture, scaled up and retrained on massive text corpora, is exactly what modern LLMs are built on.

01. What It Is

Artificial intelligence is the broad goal: build machines that perform tasks normally requiring human cognition. Machine learning is one strategy for achieving that goal, one where you supply data and a learning algorithm rather than explicit rules. Deep learning is a specific class of machine learning that uses multi-layer neural networks and, crucially, learns its own features instead of relying on hand-designed ones.

The nesting looks like this: every deep learning system is a machine learning system, and every machine learning system is an AI system, but not every AI system uses machine learning and not every machine learning system uses deep learning. A 1990s chess engine was AI but not machine learning. A random forest spam classifier is machine learning but not deep learning. GPT-4 is all three.

02. Why It Matters

Before machine learning, writing software to recognize a handwritten digit, translate a sentence, or detect a tumor required experts to specify every rule explicitly. That approach hits a ceiling: human understanding of these tasks is tacit, hard to articulate, and enormously context-dependent. Machine learning shifts the bottleneck from rule-writing to data collection, and for many tasks data is easier to gather than rules are to write. Deep learning specifically made tasks like image recognition, speech synthesis, and natural language understanding tractable for the first time, and those breakthroughs directly enabled modern AI products.

03. How It Works

Neural networks

A neural network is a mathematical function built from layers of simple units called neurons. Each neuron takes several numeric inputs, multiplies each input by a weight (a number representing how much that input matters), sums the results, adds a bias term, and then passes the sum through an activation function that introduces non-linearity. Without activation functions, stacking layers would be equivalent to a single linear transformation and the network would have no more expressive power than basic regression.

Common activation functions include ReLU (which outputs zero for negative inputs and the input value for positive ones), sigmoid (which squashes outputs to the 0-1 range, useful for probability outputs), and softmax (which converts a vector of raw scores into a probability distribution over multiple classes).

A network has an input layer, one or more hidden layers, and an output layer. Information flows forward from input to output in a forward pass. The depth of a network refers to its number of layers. Deep learning simply means the network has many hidden layers, typically tens to hundreds in modern architectures.

Training as optimization

Training is the process of finding the weight values that make the network's outputs as accurate as possible across a labeled dataset. The quality of a prediction is measured by a loss function (also called a cost function). For a classification problem this is typically cross-entropy loss. For regression it is often mean squared error. Lower loss means better predictions.

The goal is to minimize the loss. Gradient descent is the standard algorithm: compute the gradient of the loss with respect to every weight (the gradient points in the direction of steepest increase), then nudge each weight a small step in the opposite direction. The size of that step is the learning rate. Repeat across many batches of data until loss plateaus.

Backpropagation is the algorithm that computes those gradients efficiently. After a forward pass produces a prediction and the loss is computed, backpropagation uses the chain rule of calculus to propagate the error signal backward through each layer, computing each weight's contribution to the overall loss. Modern frameworks like PyTorch and TensorFlow handle this automatically via automatic differentiation.

Supervised, unsupervised, and reinforcement learning

Supervised learning trains on labeled examples: input-output pairs where a human has annotated the correct answer. The model learns a mapping from inputs to outputs. Image classifiers, spam detectors, and most LLM fine-tuning fall here.

Unsupervised learning finds structure in data without labels. Clustering algorithms group similar items. Autoencoders learn compressed representations of inputs. Generative models learn the underlying distribution of the data. Much of LLM pre-training is technically self-supervised (the model predicts the next token from the preceding ones) which is a form of unsupervised learning that manufactures its own labels from raw data.

Reinforcement learning trains an agent that takes actions in an environment and receives reward or penalty signals over time. There is no labeled dataset. The agent learns a policy that maximizes cumulative reward through trial and error. AlphaGo used RL. RLHF (reinforcement learning from human feedback) is used to align LLMs to human preferences after initial pre-training.

Overfitting vs. generalization

A model that fits its training data too closely memorizes noise and specific examples rather than learning the underlying pattern. It scores well on training data but fails on new inputs. This is overfitting. The opposite failure, underfitting, is a model too simple to capture the pattern even in the training data.

Generalization is the ability to perform well on new data not seen during training. Techniques to improve generalization include dropout (randomly deactivating neurons during training to prevent co-adaptation), weight regularization (penalizing large weight values), data augmentation (artificially expanding the training set), and early stopping (halting training before the model over-memorizes).

04. Key Terms and Model Types

Features are the input variables the model learns from. In classical machine learning, features are often hand-engineered (pixel intensities, word counts, age, income). Deep learning learns features automatically from raw data.

Regression predicts a continuous value. Linear regression is the simplest case: a weighted sum of inputs. Neural networks can fit non-linear regression problems.

Classification predicts a discrete category. Logistic regression is a classic binary classifier. Support vector machines find optimal decision boundaries. Neural networks generalize both.

CNNs (convolutional neural networks) were the dominant architecture for image tasks before transformers. They apply learned filters across local regions of an input image, detecting edges, textures, and eventually complex shapes in a hierarchical way. Each convolutional layer detects progressively more abstract features. This spatial parameter sharing made CNNs far more efficient than fully connected networks on grid-structured data.

RNNs (recurrent neural networks) and their variant LSTMs (long short-term memory networks) process sequential data like text by maintaining a hidden state that carries information from one timestep to the next. They were the dominant NLP architecture until 2017. Their critical weakness is that the hidden state compresses all prior context into a fixed-size vector, making it hard to retain information from far back in a sequence, and they cannot be parallelized across timesteps, making training slow.

05. Examples and Analogies

Gradient descent is often compared to descending a foggy mountain by always taking a step in the downhill direction. The loss landscape is the mountain. The goal is to reach the lowest valley. The learning rate controls how large each step is. Too large and you overshoot the valley. Too small and you take forever.

Overfitting is like a student who memorizes every question on past exams word-for-word but cannot answer a rephrased version of the same concept.

Supervised learning is like studying with an answer key. Unsupervised learning is like trying to organize a pile of documents into topics without being told what topics exist. Reinforcement learning is like learning to ride a bicycle: no one annotates each micro-movement as correct or incorrect, but you feel the reward of staying upright.

06. How This Leads to LLMs

The transformer architecture (introduced in "Attention Is All You Need," Vaswani et al., 2017) solved the RNN's two core problems. Self-attention allows a model to relate every token in a sequence to every other token directly, without compressing past context into a single state. And because attention computations are independent across positions, they can run in parallel, making training on massive datasets tractable.

LLMs are transformer-based networks pre-trained on enormous text corpora using the self-supervised next-token prediction objective. The network learns to predict the most probable next token given all preceding tokens, which requires learning grammar, facts, reasoning patterns, and world knowledge to do well. Scale (more parameters, more data, more compute) consistently produces more capable models, an empirical observation formalized as the neural scaling laws (Kaplan et al., 2020).

Everything covered above (weights, gradient descent, backpropagation, overfitting, supervised learning, the RNN) is the foundation on which transformers and LLMs stand. The architecture changed. The optimization loop, the loss function, the basic neuron math, and the need for generalization did not.

07. Common Pitfalls and Misconceptions

"More data always fixes everything."
More data helps, but if the data is mislabeled, biased, or non-representative of the target distribution, more of it compounds the problem.

"Deep learning models understand what they're doing."
They compute weighted sums and activations. The impressive outputs emerge from scale and optimization, not comprehension in the human sense.

"Training and evaluation on the same data is fine."
This is data leakage. Evaluating on training data will always look better than real-world performance. A held-out test set is mandatory.

"Bigger networks always generalize better."
Beyond a certain point, bigger networks without enough data or regularization overfit more severely. Modern LLMs avoid this partly through massive dataset scale and partly through RLHF alignment.