03. How It Works
Simulation:
Physics-based simulators produce labeled training data for robotics, autonomous vehicles, and game AI. Unreal Engine, Isaac Sim, and similar tools render photorealistic scenes with automatically generated ground-truth annotations (bounding boxes, depth maps, semantic segmentation).
Generative Adversarial Networks (GANs):
A generator network produces candidate samples and a discriminator network tries to distinguish them from real data. The two networks train adversarially until the generator reliably fools the discriminator. GANs are widely used for image synthesis and tabular data generation.
Variational Autoencoders (VAEs):
An encoder compresses real samples into a latent distribution and a decoder reconstructs samples from points drawn from that distribution. VAEs produce smoother, more controllable outputs than GANs for structured data.
LLM-generated synthetic instruction data:
Large language models are prompted to generate new training examples, including question-answer pairs, reasoning traces, and task instructions. This is now a primary mechanism for building instruction-following datasets. The Flan Collection (Longpre et al., 2023, arXiv:2301.13688) showed that training with mixed prompt settings on carefully designed templates substantially improves instruction-following performance. The TinyStories dataset (Eldan and Li, 2023, arXiv:2305.07759) demonstrated that GPT-3.5 and GPT-4 can generate a synthetic corpus of short stories sufficient to train coherent small language models below 10 million parameters.
Distillation via synthetic data:
A large "teacher" model generates outputs that a smaller "student" model is trained to reproduce. This is not strictly the same as knowledge distillation (which operates on soft probability distributions), but the overlap is significant.
See Knowledge Distillation for a full treatment.
Persona-driven synthesis at scale:
Ge et al. (2024, arXiv:2406.20094) introduced Persona Hub, a collection of one billion diverse personas automatically curated from web data. By conditioning generation on diverse personas, synthetic data inherits greater diversity in vocabulary, style, and subject matter.
Data augmentation:
The least expensive form of synthetic data. Existing real samples are transformed (random crop, rotation, translation, color jitter for images; back-translation, synonym substitution for text) to produce new training examples. Augmentation does not generate content from scratch but multiplies the effective dataset size.