03. How It Works
Linear regression
Fits a straight line (or hyperplane in multiple dimensions) through the training data by minimizing the sum of squared residuals between predicted and actual values. The result is a set of coefficients, one per input feature, each telling you how much the output changes per unit increase in that feature. Ordinary least squares has a closed-form analytical solution; no iterative training is required for small datasets.
Logistic regression
Extends linear regression to binary classification by squashing the linear output through a sigmoid function, which maps any real number to the 0-1 range and can be interpreted as a probability. The decision boundary is still linear (a hyperplane), which makes logistic regression fast, auditable, and a natural baseline for any binary classification problem.
Decision trees
Recursively split the feature space into axis-aligned rectangular regions by choosing at each node the feature and threshold that maximizes information gain (or minimizes Gini impurity). The result is a tree of if-then rules. A single deep decision tree overfits severely because it memorizes idiosyncratic patterns in the training data.
Random forests
An ensemble of decision trees, each trained on a bootstrap sample of the data (sampling with replacement) and considering only a random subset of features at each split. The final prediction is the majority vote (classification) or average (regression) across all trees. Randomness decorrelates the individual trees, so their errors partially cancel. Random forests were introduced by Leo Breiman in 2001 and correct for the individual tree's tendency to overfit.
Gradient boosting (XGBoost)
Builds an ensemble of shallow trees sequentially rather than in parallel. Each new tree is trained to predict the residual errors of the current ensemble. This additive process progressively reduces the training error. XGBoost (Chen and Guestrin, 2016) added L1/L2 regularization, column subsampling, and efficient handling of sparse data, making it the algorithm that dominated Kaggle tabular competitions for several years. Gradient boosting is generally more accurate than random forests on tabular data but more prone to overfitting if hyperparameters are not tuned.
Support vector machines (SVM)
Find the hyperplane that maximally separates two classes by maximizing the margin: the distance between the decision boundary and the nearest training examples from each class. Those nearest examples are called support vectors. The kernel trick allows SVMs to fit nonlinear decision boundaries by implicitly mapping the input into a higher-dimensional space. SVMs were state of the art for text classification, image recognition, and bioinformatics problems throughout the 2000s before deep learning superseded them on image and language tasks.
K-nearest neighbors (KNN)
A non-parametric algorithm that stores all training examples and classifies a new input by finding its k most similar training examples and returning their majority label. Similarity is typically measured by Euclidean distance. KNN has no explicit training phase (it is lazy), but prediction is slow for large datasets because every query requires scanning all stored examples.
K-means clustering
Partitions n data points into k clusters by iteratively assigning each point to the nearest centroid and recomputing centroids as the mean of assigned points. Repeats until assignments stabilize. K must be chosen in advance. K-means is sensitive to initialization; the k-means++ initialization algorithm (Arthur and Vassilvitskii, 2007) selects initial centroids to be far apart, reducing the chance of poor local minima.
PCA (Principal Component Analysis)
Computes the directions of maximum variance in the data and projects the data onto the top k of those directions, called principal components. The result is a lower-dimensional representation that preserves as much information as possible. PCA is used to reduce feature dimensionality before training other models, to visualize high-dimensional data in 2D or 3D, and to remove correlated features that confuse linear models.