Classic ML Algorithms

In Short

Before deep learning dominated the headlines, a small set of algorithms solved most practical prediction problems and still do. Linear regression, decision trees, random forests, gradient boosting, SVMs, k-nearest neighbors, k-means, and PCA remain the workhorses of production ML because they are fast, interpretable, and require far less data and compute than neural networks for many tasks.

01. What It Is

Classic machine learning algorithms are statistical techniques that learn a mapping from input features to an output by optimizing an objective function over labeled (or unlabeled) data. They predate deep learning and were the primary tools of applied ML from the 1950s through the 2010s.

Supervised learning uses labeled training pairs: each example has an input (features) and a known output (label). The model learns to predict the label from the features. Linear regression, logistic regression, decision trees, random forests, gradient boosting, SVMs, and k-nearest neighbors are all supervised algorithms.

Unsupervised learning has no labels. The goal is to find structure in the data itself: cluster similar examples together, or reduce the number of dimensions while preserving the most information. K-means clustering and PCA are the canonical unsupervised algorithms.

02. Why It Matters

Two reasons these algorithms still power most production ML systems.

First, interpretability. A bank denying a loan must be able to explain the decision. A linear model or a shallow decision tree can be audited. A 70-billion-parameter neural network cannot.

Second, data efficiency. Deep learning typically requires tens of thousands to millions of labeled examples to generalize. Many real-world problems have hundreds or thousands of examples. A random forest or gradient-boosted tree will dramatically outperform a neural network in that regime.

According to a 2022 Kaggle survey, gradient boosting (XGBoost, LightGBM, CatBoost) and logistic regression were cited more often than neural networks among the most regularly used algorithms.

03. How It Works

Linear regression

Fits a straight line (or hyperplane in multiple dimensions) through the training data by minimizing the sum of squared residuals between predicted and actual values. The result is a set of coefficients, one per input feature, each telling you how much the output changes per unit increase in that feature. Ordinary least squares has a closed-form analytical solution. No iterative training is required for small datasets.

Logistic regression

Extends linear regression to binary classification by squashing the linear output through a sigmoid function, which maps any real number to the 0-1 range and can be interpreted as a probability. The decision boundary is still linear (a hyperplane), which makes logistic regression fast, auditable, and a natural baseline for any binary classification problem.

Decision trees

Recursively split the feature space into axis-aligned rectangular regions by choosing at each node the feature and threshold that maximizes information gain (or minimizes Gini impurity). The result is a tree of if-then rules. A single deep decision tree overfits severely because it memorizes idiosyncratic patterns in the training data.

Random forests

An ensemble of decision trees, each trained on a bootstrap sample of the data (sampling with replacement) and considering only a random subset of features at each split. The final prediction is the majority vote (classification) or average (regression) across all trees. Randomness decorrelates the individual trees, so their errors partially cancel. Random forests were introduced by Leo Breiman in 2001 and correct for the individual tree's tendency to overfit.

Gradient boosting (XGBoost)

Builds an ensemble of shallow trees sequentially rather than in parallel. Each new tree is trained to predict the residual errors of the current ensemble. This additive process progressively reduces the training error. XGBoost (Chen and Guestrin, 2016) added L1/L2 regularization, column subsampling, and efficient handling of sparse data, making it the algorithm that dominated Kaggle tabular competitions for several years. Gradient boosting is generally more accurate than random forests on tabular data but more prone to overfitting if hyperparameters are not tuned.

Support vector machines (SVM)

Find the hyperplane that maximally separates two classes by maximizing the margin: the distance between the decision boundary and the nearest training examples from each class. Those nearest examples are called support vectors. The kernel trick allows SVMs to fit nonlinear decision boundaries by implicitly mapping the input into a higher-dimensional space. SVMs were state of the art for text classification, image recognition, and bioinformatics problems throughout the 2000s before deep learning superseded them on image and language tasks.

K-nearest neighbors (KNN)

A non-parametric algorithm that stores all training examples and classifies a new input by finding its k most similar training examples and returning their majority label. Similarity is typically measured by Euclidean distance. KNN has no explicit training phase (it is lazy), but prediction is slow for large datasets because every query requires scanning all stored examples.

K-means clustering

Partitions n data points into k clusters by iteratively assigning each point to the nearest centroid and recomputing centroids as the mean of assigned points. Repeats until assignments stabilize. K must be chosen in advance. K-means is sensitive to initialization. The k-means++ initialization algorithm (Arthur and Vassilvitskii, 2007) selects initial centroids to be far apart, reducing the chance of poor local minima.

PCA (Principal Component Analysis)

Computes the directions of maximum variance in the data and projects the data onto the top k of those directions, called principal components. The result is a lower-dimensional representation that preserves as much information as possible. PCA is used to reduce feature dimensionality before training other models, to visualize high-dimensional data in 2D or 3D, and to remove correlated features that confuse linear models.

04. The Bias-Variance Tradeoff

Every model makes two types of error. Bias is error from wrong assumptions in the model: a linear model fitting a curved relationship has high bias. Variance is error from sensitivity to fluctuations in the training data: a deep tree that perfectly memorizes the training set has high variance and poor generalization.

Increasing model complexity reduces bias but increases variance. The goal is to find the sweet spot. Cross-validation is the standard technique for finding that point: partition the data into k folds, train on k-1 folds, evaluate on the held-out fold, rotate, and average the results. K-fold cross-validation gives a reliable estimate of generalization error without setting aside a separate validation set, though a held-out test set is still used for final evaluation.

05. Key Terms

Overfitting:
The model memorizes noise in the training data and performs poorly on new data. Regularization: Adding a penalty term (L1 = Lasso, L2 = Ridge) to the loss function to discourage large coefficients and reduce overfitting. Feature importance: In tree-based models, a measure of how much each feature contributes to reducing prediction error across all splits. Ensemble: A collection of models whose predictions are combined. Random forests and gradient boosting are ensembles of decision trees. Hyperparameter: A parameter set before training (number of trees, learning rate, max depth) as opposed to a parameter learned during training.

06. Examples

A fraud detection model at a bank uses logistic regression as a fast, interpretable baseline and gradient boosting as the production model. The logistic model is used to explain rejections to regulators. The gradient-boosted model is used for actual decisions.

A manufacturing plant clusters sensor readings using k-means to identify normal operating modes, then flags readings that fall outside any cluster as anomalies requiring inspection.

A genetics researcher uses PCA to reduce 20,000 gene expression features to 50 principal components before feeding the result to a random forest classifier, because the random forest would overfit massively on 20,000 features with only 500 training samples.

07. Common Pitfalls and Misconceptions

"Deep learning is always better."
On structured tabular data with moderate dataset sizes, gradient boosting consistently matches or outperforms neural networks with a fraction of the compute. Choosing the right algorithm for the data type and scale matters more than defaulting to the most recent architecture.

"K-fold cross-validation prevents overfitting."
Cross-validation measures generalization. It does not prevent it. The model still overfits to the training folds. Cross-validation just tells you how severe that overfitting is so you can compare hyperparameter settings fairly.

"More features always help."
Adding irrelevant or noisy features hurts linear models and KNN. Dimensionality reduction (PCA, feature selection) is often necessary before applying these algorithms to high-dimensional data.

"SVMs are obsolete."
SVMs with tuned kernels still match or beat neural networks on small biological and medical datasets where sample sizes are in the hundreds.