03. How It Works: The Model Families
There are several distinct technical families, each with different strengths.
Autoregressive transformers (LLMs and autoregressive image models)
These models predict the next token in a sequence, conditioned on all previous tokens. For text, a token is roughly a word fragment. The transformer architecture, introduced by Google researchers Vaswani et al. in 2017 in "Attention Is All You Need" (arXiv:1706.03762), replaced earlier recurrent networks by processing all positions in a sequence in parallel, enabling training on vastly more data. GPT-series models, LLaMA, Gemini, and Claude are all autoregressive transformers. The same principle applies to autoregressive image models such as DALL-E (first version) and later VQ-VAE-based image generation, where pixels or image tokens are predicted sequentially.
Diffusion models
Diffusion models work in two stages. In the forward pass, Gaussian noise is added to real data incrementally until the data becomes indistinguishable from random noise. The model is then trained to reverse this process, denoising the data step by step. At inference time, the model starts from pure noise and iteratively denoises it into a coherent image, audio clip, or video segment. Stable Diffusion, DALL-E 2 and 3, Midjourney, and Sora-style video models are diffusion-based. As of 2024, diffusion models dominate image and video generation due to their output quality and controllability. (See also: Diffusion and Image Generation.)
Generative Adversarial Networks (GANs)
Introduced by Ian Goodfellow and colleagues in 2014, GANs train two networks against each other: a generator that produces fake samples and a discriminator that tries to distinguish fake from real. The adversarial dynamic drives the generator toward increasingly realistic outputs. GANs produced the first photorealistic AI portraits and deepfakes, and remained dominant in image synthesis through roughly 2021, when diffusion models surpassed them in most benchmarks. GANs remain useful for real-time generation tasks where diffusion inference latency is prohibitive.
Variational Autoencoders (VAEs)
VAEs, introduced by Kingma and Welling in 2013, were among the first deep generative models to scale to complex data. An encoder compresses input data into a latent probability distribution, and a decoder samples from that distribution to reconstruct (and vary) the output. VAEs enabled early realistic image and speech synthesis and established the encoder-decoder pattern that underlies most modern architectures. They are often used today as a component inside larger systems rather than as standalone generators.
Flow-based models
Normalizing flows learn an invertible mapping between data and a simple distribution (typically Gaussian). Because the mapping is invertible, exact likelihoods can be computed, which is not possible with GANs. Flow models are less common in end-user products but are used in specialized scientific applications such as molecular generation. Glow (OpenAI, 2018) is a well-known example.