Skip to content

Synthetic Data

Under the Hood 6 min read

In Short

Synthetic data is machine-generated data that mimics the statistical properties of real data without containing actual records. It solves problems of data scarcity, privacy exposure, and class imbalance in ML training, but carries serious risks if a model is trained recursively on its own outputs.

01. What It Is

Synthetic data is any data produced by a computational process rather than collected from real-world events or people. The generated examples are designed to match the distribution, structure, and statistical properties of a target dataset, but the records themselves refer to no real individual or event.

In machine learning, "synthetic data" spans a range of techniques. At one end, simple augmentation transforms existing samples (flipping an image, adding noise to audio). At the other end, a large language model generates entirely new question-answer pairs, reasoning chains, or instruction examples that have no real-world counterpart at all.

02. Why It Matters

Four pressures push practitioners toward synthetic data.

Privacy and regulation:
Medical records, financial transactions, and mobile-device logs contain personally identifiable information. GDPR, HIPAA, and similar regulations make sharing raw data across organizational boundaries costly or illegal. Synthetic records that preserve statistical patterns but contain no real individuals can be shared or published without the same legal exposure. This is distinct from, but related to, de-identification (removing or masking direct identifiers in real data). Synthetic generation goes further by constructing new records from scratch.

Scarcity:
Rare diseases, uncommon fraud patterns, edge-case driving scenarios, and low-resource languages all suffer from the same problem: there is not enough real data to train a reliable model. Simulation engines and generative models can manufacture thousands of examples of a rare class where real-world collection would take years.

Class imbalance:
When one class appears far more often than another, a model trained on the real distribution learns to ignore the rare class. Synthetic oversampling techniques such as SMOTE (Synthetic Minority Over-sampling Technique) generate interpolated examples for the underrepresented class, improving recall without collecting more real data.

Cost:
Labeling real data requires human annotators. Generating labeled synthetic examples from a teacher model or a simulator is orders of magnitude cheaper. A case study from Hugging Face (2024) showed that training a custom RoBERTa model on synthetic labels from an open-source LLM cost approximately $2.70 to analyze a large news corpus, compared to roughly $3,061 with GPT-4 API calls for the same task, at equivalent accuracy.

03. How It Works

Simulation:
Physics-based simulators produce labeled training data for robotics, autonomous vehicles, and game AI. Unreal Engine, Isaac Sim, and similar tools render photorealistic scenes with automatically generated ground-truth annotations (bounding boxes, depth maps, semantic segmentation).

Generative Adversarial Networks (GANs):
A generator network produces candidate samples and a discriminator network tries to distinguish them from real data. The two networks train adversarially until the generator reliably fools the discriminator. GANs are widely used for image synthesis and tabular data generation.

Variational Autoencoders (VAEs):
An encoder compresses real samples into a latent distribution and a decoder reconstructs samples from points drawn from that distribution. VAEs produce smoother, more controllable outputs than GANs for structured data.

LLM-generated synthetic instruction data:
Large language models are prompted to generate new training examples, including question-answer pairs, reasoning traces, and task instructions. This is now a primary mechanism for building instruction-following datasets. The Flan Collection (Longpre et al., 2023, arXiv:2301.13688) showed that training with mixed prompt settings on carefully designed templates substantially improves instruction-following performance. The TinyStories dataset (Eldan and Li, 2023, arXiv:2305.07759) demonstrated that GPT-3.5 and GPT-4 can generate a synthetic corpus of short stories sufficient to train coherent small language models below 10 million parameters.

Distillation via synthetic data:
A large "teacher" model generates outputs that a smaller "student" model is trained to reproduce. This is not strictly the same as knowledge distillation (which operates on soft probability distributions), but the overlap is significant.
See Knowledge Distillation for a full treatment.

Persona-driven synthesis at scale:
Ge et al. (2024, arXiv:2406.20094) introduced Persona Hub, a collection of one billion diverse personas automatically curated from web data. By conditioning generation on diverse personas, synthetic data inherits greater diversity in vocabulary, style, and subject matter.

Data augmentation:
The least expensive form of synthetic data. Existing real samples are transformed (random crop, rotation, translation, color jitter for images; back-translation, synonym substitution for text) to produce new training examples. Augmentation does not generate content from scratch but multiplies the effective dataset size.

04. Key Terms / Methods

  • SMOTE. Synthetic Minority Over-sampling Technique. Creates new samples by linear interpolation between existing minority-class examples in feature space.
  • Differential privacy (for synthetic generation). Adding calibrated noise during the generation process so that the statistical outputs cannot be used to recover individual records from the training set.
  • De-identification. Removing or replacing direct identifiers (names, dates, IDs) in real data. Weaker than full synthesis because the underlying record structure is preserved.
  • Instruction tuning data. Synthetic prompts and completions used to fine-tune a pre-trained LLM to follow instructions. LIMA (Zhou et al., 2023, arXiv:2305.11206) showed that only 1,000 carefully curated examples could produce strong instruction-following behavior, suggesting quality matters more than volume.
  • Distribution drift. When synthetic data has a different statistical distribution than real deployment data, the model trained on it will generalize poorly in production.
  • Model collapse. A feedback loop in which a model is trained on synthetic outputs from a previous model generation, causing rare patterns from the original distribution to vanish over iterations.

05. Examples

  • Gboard (Google Keyboard). Synthetic next-word prediction data generated on-device supplements the keyboard's language model without exposing user keystrokes.
  • Medical imaging. Hospitals use GANs to synthesize labeled radiology images for rare pathologies, supplementing real annotated scans.
  • Autonomous vehicles. Waymo and NVIDIA generate billions of synthetic driving miles in simulation to cover edge cases (ice, fog, pedestrian occlusion) that rarely occur in recorded real-world drives.
  • phi-1 (Microsoft Research). A 1.3B-parameter code model trained on a mix of about 6B tokens of filtered web data plus roughly 1B tokens of "textbook-quality" synthetic code examples generated by GPT-3.5, outperforming much larger models on coding benchmarks.
  • Financial fraud detection. Banks synthesize minority-class fraud transactions to train classifiers without sharing actual transaction records across institutions.

06. Common Pitfalls / Misconceptions

Synthetic data is not automatically private:
A GAN or LLM trained on private data can memorize and reproduce individual records from the training set. Differential privacy guarantees during synthesis are required to provide formal privacy protection.

Model collapse is a real and accumulating risk:
Shumailov et al. (2023, arXiv:2305.17493) demonstrated theoretically and empirically that training models recursively on their own generated outputs causes tails of the original distribution to disappear. Over multiple generations, the model converges to an impoverished representation of the true distribution. This is not a hypothetical concern: as LLM-generated text increasingly saturates the web, future pre-training crawls will contain ever-larger fractions of synthetic text. Mitigation requires preserving a fraction of verified human-generated data in each training run.

Distribution drift goes unnoticed until deployment:
Synthetic data can be internally consistent but systematically offset from real deployment conditions. A model trained on synthetic medical notes generated in English performs differently on actual clinical notes with abbreviations, typos, and regional terminology.

Augmentation is not the same as generation:
Cropping and flipping images does not introduce genuinely new information into the training set. Heavy reliance on augmentation can create an illusion of large datasets while the effective diversity remains low.