03. How It Works
Data collection
Data comes from web scraping, user interactions, sensors, existing databases, purchased datasets, or manual creation. The source determines the biases. Web-scraped text reflects the demographics, languages, and viewpoints overrepresented on the internet. Medical imaging data collected at a single hospital reflects that hospital's patient population, equipment, and radiologist annotation style.
Labeling and annotation
Supervised learning requires labeled data. Labeling is done by human annotators, crowdsourcing platforms (Amazon Mechanical Turk), domain experts, or programmatic heuristics (weak supervision). Label quality directly affects ceiling accuracy. Disagreements between annotators (inter-annotator disagreement) reveal ambiguous cases where even humans cannot agree on the correct label. Most production annotation pipelines define explicit labeling guidelines and measure inter-annotator agreement using metrics like Cohen's kappa.
Train, validation, and test splits
The training set is used to fit model parameters. The validation set (also called the development set) is used to tune hyperparameters and compare model architectures. The test set is held out until final evaluation and touched exactly once. This three-way split prevents the model from being optimized, even indirectly, against the test set.
Typical splits for medium-sized datasets are 60-70% train, 10-20% validation, 10-20% test. For large datasets (millions of examples), validation and test sets can be smaller fractions because even 1% of a million examples is 10,000 samples, which is sufficient for reliable evaluation.
Data quality and class imbalance
Real-world datasets are rarely balanced. Fraud is 0.1% of transactions. Rare diseases appear in 1 in 10,000 patients. A naive model that always predicts the majority class achieves 99.9% accuracy on a fraud dataset while detecting zero fraud cases. Class imbalance requires strategies such as oversampling the minority class (SMOTE), undersampling the majority class, adjusting class weights in the loss function, or using precision-recall metrics instead of accuracy.
Data leakage
Leakage occurs when information that would not be available at prediction time is inadvertently included in the training data. A classic example: a model predicting hospital readmission that includes the discharge summary as a feature, which is only written after the outcome is known. Leakage inflates training and validation performance, producing a model that fails catastrophically in production. Temporal leakage (using future data to predict the past) is the most common form.
Data augmentation
Artificially expanding the training set by applying transformations that preserve the label. For images: random crops, flips, rotations, color jitter, Gaussian noise. For text: synonym replacement, back-translation, random insertion and deletion. For audio: time stretching, pitch shifting, adding background noise. Augmentation reduces overfitting by showing the model multiple variations of each training example.