Skip to content

Data and Datasets

Foundations 6 min read

In Short

Machine learning models are only as good as the data they are trained on. Data collection, labeling quality, proper train/validation/test splits, and careful handling of class imbalance and leakage determine whether a model generalizes in the real world. The data-centric AI movement formalizes what practitioners have known for decades: for most problems, improving data quality yields bigger gains than improving model architecture.

01. What It Is

A dataset in machine learning is a structured collection of examples used to train, tune, and evaluate a model. Each example typically consists of one or more input features and, for supervised tasks, a label (the correct output). Datasets exist in every modality: text, images, audio, tabular records, time series, and multi-modal combinations.

Famous benchmark datasets include:

MNIST (LeCun et al., 1998). 70,000 grayscale images of handwritten digits (0-9), 28x28 pixels each. The standard "hello world" benchmark for image classification. A model that cannot beat 99% accuracy on MNIST is almost certainly broken.

ImageNet:
14 million labeled images across over 20,000 categories, with a competition subset (ILSVRC) of 1.2 million images across 1,000 categories. The ImageNet Large Scale Visual Recognition Challenge, won by a deep CNN (AlexNet) in 2012 with an unprecedented accuracy gap, is widely cited as the moment that triggered the deep learning revolution.

Common Crawl:
A non-profit organization that has been crawling the web since 2008 and releasing petabyte-scale dumps of raw web text. Common Crawl is the primary pre-training corpus (or a major component) of most large language models, including GPT-3, LLaMA, and many others.

SQuAD, GLUE, SuperGLUE, MMLU:
Benchmark datasets for evaluating language model capabilities: reading comprehension, natural language inference, and broad knowledge across academic subjects.

02. Why It Matters

A model trained on biased, mislabeled, or unrepresentative data will reflect those flaws in its predictions, regardless of architectural sophistication. Andrew Ng's data-centric AI initiative (2021) argued formally that for many production ML problems, the bottleneck is data quality, not model architecture. Fixing the data typically yields larger accuracy improvements than swapping out the model.

The phrase "garbage in, garbage out" predates machine learning, but it applies with particular force here: unlike a human programmer who might notice that input data looks wrong, a model will confidently learn to replicate whatever pattern is in the data, even if that pattern is noise, bias, or annotation error.

03. How It Works

Data collection

Data comes from web scraping, user interactions, sensors, existing databases, purchased datasets, or manual creation. The source determines the biases. Web-scraped text reflects the demographics, languages, and viewpoints overrepresented on the internet. Medical imaging data collected at a single hospital reflects that hospital's patient population, equipment, and radiologist annotation style.

Labeling and annotation

Supervised learning requires labeled data. Labeling is done by human annotators, crowdsourcing platforms (Amazon Mechanical Turk), domain experts, or programmatic heuristics (weak supervision). Label quality directly affects ceiling accuracy. Disagreements between annotators (inter-annotator disagreement) reveal ambiguous cases where even humans cannot agree on the correct label. Most production annotation pipelines define explicit labeling guidelines and measure inter-annotator agreement using metrics like Cohen's kappa.

Train, validation, and test splits

The training set is used to fit model parameters. The validation set (also called the development set) is used to tune hyperparameters and compare model architectures. The test set is held out until final evaluation and touched exactly once. This three-way split prevents the model from being optimized, even indirectly, against the test set.

Typical splits for medium-sized datasets are 60-70% train, 10-20% validation, 10-20% test. For large datasets (millions of examples), validation and test sets can be smaller fractions because even 1% of a million examples is 10,000 samples, which is sufficient for reliable evaluation.

Data quality and class imbalance

Real-world datasets are rarely balanced. Fraud is 0.1% of transactions. Rare diseases appear in 1 in 10,000 patients. A naive model that always predicts the majority class achieves 99.9% accuracy on a fraud dataset while detecting zero fraud cases. Class imbalance requires strategies such as oversampling the minority class (SMOTE), undersampling the majority class, adjusting class weights in the loss function, or using precision-recall metrics instead of accuracy.

Data leakage

Leakage occurs when information that would not be available at prediction time is inadvertently included in the training data. A classic example: a model predicting hospital readmission that includes the discharge summary as a feature, which is only written after the outcome is known. Leakage inflates training and validation performance, producing a model that fails catastrophically in production. Temporal leakage (using future data to predict the past) is the most common form.

Data augmentation

Artificially expanding the training set by applying transformations that preserve the label. For images: random crops, flips, rotations, color jitter, Gaussian noise. For text: synonym replacement, back-translation, random insertion and deletion. For audio: time stretching, pitch shifting, adding background noise. Augmentation reduces overfitting by showing the model multiple variations of each training example.

04. Key Terms

Label: The correct output for a supervised learning example. Also called annotation or ground truth. Features: The input variables. Raw features are the original data; engineered features are derived from it. Train set: Data used to fit model parameters. Validation set: Data used to tune hyperparameters and compare models. Also called dev set. Test set: Data held out for final evaluation. Touched only once. Data leakage: The inclusion of information in training data that would not be available at prediction time, causing inflated apparent performance. Class imbalance: A distribution of labels where one class appears far more frequently than others. Augmentation: Generating new training examples by applying label-preserving transformations to existing ones. Annotation agreement (Cohen's kappa): A measure of label consistency between multiple human annotators. Data-centric AI: The approach of improving model performance by improving data quality rather than model architecture.

05. Examples

ImageNet demonstrated that scale matters: going from tens of thousands to over a million labeled images enabled the deep CNN revolution of 2012. The quality and diversity of the labeling (each image verified by multiple annotators) was as important as the scale.

Common Crawl illustrates the trade-off between scale and quality. The raw crawl contains spam, boilerplate, and toxic content. GPT-3 training filtered Common Crawl with a quality classifier trained on curated sources, discarding roughly 70% of raw crawl data. Careful filtering, not just raw scale, produced a useful pre-training corpus.

A medical imaging startup training a tumor classifier discovered after deployment that images from their partner hospital were 30% larger than training images due to a scanner resolution difference. The model's accuracy dropped 15 percentage points because the "important texture" it learned was a function of resolution, not actual pathology. This is distributional shift: a data quality problem that no architecture improvement could have fixed.

06. Common Pitfalls and Misconceptions

"Bigger datasets always win."
Dataset size matters, but unfiltered scale can hurt. A billion noisy or mislabeled examples can produce a worse model than 100 million carefully curated ones.

"Evaluating on the validation set is fine for final reporting."
Repeatedly tuning against the validation set causes the model to indirectly overfit to it. The test set must be touched only at the end, once, after all decisions are locked in.

"Augmentation is free improvement."
Augmentation that introduces domain-mismatch (e.g., flipping chest X-rays horizontally, which is not a real anatomical variation) can introduce artifacts and hurt performance.

"Balanced datasets are always better."
Artificially balancing a heavily imbalanced dataset by massive oversampling of the minority class can lead to overfitting on rare examples. The right strategy depends on the specific cost asymmetry between false positives and false negatives.

Verified against primary sources

Every claim traces to a cited source below.

Key terms

Label
The correct output for a supervised learning example. Also called annotation or ground truth.
Validation set
Data used to tune hyperparameters and compare models. Also called dev set.
Test set
Data held out for final evaluation. Touched only once.
Data leakage
Including info in training data not available at prediction time, inflating apparent performance.
Data-centric AI
Improving model performance by improving data quality rather than model architecture.

Tags

#datasets #data-quality #machine-learning #data-centric-ai #class-imbalance #data-augmentation

More in Machine Learning