03. How It Works
Diffusion: the core idea
A diffusion model is trained on a dataset of real images. During training, the model learns to reverse a noise-adding process. The forward process takes a real image and adds Gaussian noise in small increments over many steps, until the image becomes pure random noise. The model is trained to predict and undo each noise increment.
Once trained, generation is the reverse: start with pure random noise, run the denoising process step by step, and a coherent image emerges.
Latent diffusion (how Stable Diffusion and Flux work)
Running diffusion directly on full-resolution pixel arrays is computationally expensive. Latent diffusion models (LDMs) solve this by compressing images into a smaller latent space first.
The pipeline has three components:
-
VAE (Variational Autoencoder):
Encodes a 512x512 RGB image (shape: 3, 512, 512) into a compressed latent representation (shape: 4, 64, 64). The decoder reverses this to produce the final image.
-
U-Net (or Diffusion Transformer):
The denoising network. It receives the noisy latent, a timestep embedding indicating how noisy the current state is, and a conditioning embedding from the text prompt. It predicts the noise to remove. Early LDMs used a U-Net with ResNet blocks and cross-attention. Newer models (Flux, Stable Diffusion 3) use Diffusion Transformers (DiTs), replacing the U-Net with a transformer architecture that scales better.
-
Text encoder (CLIP or T5):
The text prompt is tokenized and encoded into a dense embedding. Cross-attention in the U-Net/DiT allows the denoising network to "steer" toward image content matching the prompt.
The scheduler controls how many denoising steps are taken and how noise is scaled at each step. Fewer steps is faster but lower quality. Common schedulers include DDPM, DDIM, and DPM++.
From U-Net to Diffusion Transformer
The architectural history matters for understanding current models. Early Stable Diffusion versions (1.x, 2.x) used U-Nets. By 2023-2024, the field shifted toward DiTs (Diffusion Transformers), which treat image patches as tokens and process them with standard transformer self-attention. Stable Diffusion 3, Flux, and most 2025-2026 models use DiT architectures. They scale more predictably with compute and produce better coherence at high resolution.
Autoregressive image generation
An alternative approach, gaining strength in 2026: treat image generation like text generation. Encode the image into discrete tokens using a VQ-VAE or similar tokenizer, then train a transformer to predict the next image token autoregressively, conditioned on the text prompt.
Key differences from diffusion:
- Autoregressive models are stronger at following complex, structured prompts and retaining world knowledge during generation, because they share architecture with LLMs.
- Diffusion models have better-established tooling and community support.
- A 2B autoregressive model with beam search can surpass a 12B Flux model in prompt fidelity while using fewer compute operations.
- Hybrid models use an autoregressive backbone for global layout and a diffusion decoder for fine detail (e.g., GLM-Image).
GPT Image 2 uses an autoregressive approach under the hood, which explains its notably better text rendering and instruction following compared to diffusion-based competitors.
Video generation
The same diffusion and transformer principles extend to video. Video diffusion models denoise across both spatial and temporal dimensions, producing sequences of frames that are visually consistent over time. Notable tools in 2026 include OpenAI Sora (video), Runway Gen-3, and Kling. Gemini's native video understanding is separate from video generation.