Skip to content

Diffusion and Image Generation

The Landscape 6 min read

In Short

Modern AI image generation is dominated by diffusion models, which learn to reverse a noise-adding process to produce images from text. In 2026, the leading tools are Midjourney V8, Flux 2, GPT Image 2, and Stable Diffusion 3.5, each with different trade-offs between quality, control, and cost.

100%

Scroll to pan · Ctrl/Cmd + scroll to zoom · drag to pan · double-click to fit

The latent diffusion pipeline starts from a text prompt and pure random noise, then repeatedly denoises a compressed latent until the decoder produces a finished image.

01. What It Is

Text-to-image generation is the ability to produce a photorealistic or stylized image from a natural-language description. You write a prompt; the model synthesizes a new image that matches it. This is not image retrieval. The model generates novel pixel content that has never existed before.

The dominant technique is diffusion. Alongside it, a newer class called autoregressive image generation is gaining ground. Both are covered here.

02. Why It Matters

Image generation collapses the barrier between concept and visual artifact. Designers can prototype ideas in seconds. Marketers can generate campaign assets without a photographer. Developers can test UI with realistic placeholder images. The applications extend to video production, game asset creation, medical imaging augmentation, and architectural visualization.

The flip side: the same technology enables deepfakes, synthetic misinformation, and copyright-adjacent generation that reproduces artists' styles without consent. These are live contested issues in law and policy as of 2026.

03. How It Works

Diffusion: the core idea

A diffusion model is trained on a dataset of real images. During training, the model learns to reverse a noise-adding process. The forward process takes a real image and adds Gaussian noise in small increments over many steps, until the image becomes pure random noise. The model is trained to predict and undo each noise increment.

Once trained, generation is the reverse: start with pure random noise, run the denoising process step by step, and a coherent image emerges.

Latent diffusion (how Stable Diffusion and Flux work)

Running diffusion directly on full-resolution pixel arrays is computationally expensive. Latent diffusion models (LDMs) solve this by compressing images into a smaller latent space first.

The pipeline has three components:

  1. VAE (Variational Autoencoder):
    Encodes a 512x512 RGB image (shape: 3, 512, 512) into a compressed latent representation (shape: 4, 64, 64). The decoder reverses this to produce the final image.

  2. U-Net (or Diffusion Transformer):
    The denoising network. It receives the noisy latent, a timestep embedding indicating how noisy the current state is, and a conditioning embedding from the text prompt. It predicts the noise to remove. Early LDMs used a U-Net with ResNet blocks and cross-attention. Newer models (Flux, Stable Diffusion 3) use Diffusion Transformers (DiTs), replacing the U-Net with a transformer architecture that scales better.

  3. Text encoder (CLIP or T5):
    The text prompt is tokenized and encoded into a dense embedding. Cross-attention in the U-Net/DiT allows the denoising network to "steer" toward image content matching the prompt.

The scheduler controls how many denoising steps are taken and how noise is scaled at each step. Fewer steps is faster but lower quality. Common schedulers include DDPM, DDIM, and DPM++.

From U-Net to Diffusion Transformer

The architectural history matters for understanding current models. Early Stable Diffusion versions (1.x, 2.x) used U-Nets. By 2023-2024, the field shifted toward DiTs (Diffusion Transformers), which treat image patches as tokens and process them with standard transformer self-attention. Stable Diffusion 3, Flux, and most 2025-2026 models use DiT architectures. They scale more predictably with compute and produce better coherence at high resolution.

Autoregressive image generation

An alternative approach, gaining strength in 2026: treat image generation like text generation. Encode the image into discrete tokens using a VQ-VAE or similar tokenizer, then train a transformer to predict the next image token autoregressively, conditioned on the text prompt.

Key differences from diffusion:

  • Autoregressive models are stronger at following complex, structured prompts and retaining world knowledge during generation, because they share architecture with LLMs.
  • Diffusion models have better-established tooling and community support.
  • A 2B autoregressive model with beam search can surpass a 12B Flux model in prompt fidelity while using fewer compute operations.
  • Hybrid models use an autoregressive backbone for global layout and a diffusion decoder for fine detail (e.g., GLM-Image).

GPT Image 2 uses an autoregressive approach under the hood, which explains its notably better text rendering and instruction following compared to diffusion-based competitors.

Video generation

The same diffusion and transformer principles extend to video. Video diffusion models denoise across both spatial and temporal dimensions, producing sequences of frames that are visually consistent over time. Notable tools in 2026 include OpenAI Sora (video), Runway Gen-3, and Kling. Gemini's native video understanding is separate from video generation.

04. Key Terms and Players

Denoising: The core operation of diffusion. The model predicts and removes noise from a latent representation at each step.

Latent space: A compressed representation of image data. Working here instead of pixel space is what makes diffusion models practical.

CFG (Classifier-Free Guidance): A technique that makes the model follow the text prompt more strongly. Higher CFG = more prompt-faithful but sometimes over-saturated images.

LoRA (Low-Rank Adaptation): A fine-tuning method that trains small adapter layers on top of a base model. The open-source community produces thousands of LoRAs for specific styles, characters, and subjects.

Major tools in 2026:

Tool Strengths Licensing Pricing
Midjourney V8 Aesthetic quality, atmosphere, native 2K, 5x faster than V7 Closed, subscription Tiered subscription; check midjourney.com/account for current pricing
GPT Image 2 (OpenAI) Text rendering, complex instruction following, visual reasoning Closed, API Free tier; $20+/month
Flux 2 / Flux 1.1 Pro (Black Forest Labs) Photorealism, prompt adherence, excellent text rendering, 4MP native Dev: open (Apache 2.0 Klein); Pro: commercial API Free (Klein); Pro API pricing varies by resolution, check api.bfl.ml
Stable Diffusion 3.5 (Stability AI) Customizable, LoRA ecosystem, runs locally Open-source Free (hardware costs)
Ideogram 2.0 Best-in-class text rendering in images Closed, subscription Freemium
Adobe Firefly Image 3 Commercially safe (trained on licensed data), Adobe workflow integration Closed Subscription
Imagen 3 / 4 (Google) Text rendering, photorealism Closed, API via Gemini API pricing

05. Examples

  • Product photography: A startup shoots no physical photos. They prompt Flux 1.1 Pro for product lifestyle shots with specific lighting and backgrounds.
  • Character design: A game studio generates hundreds of NPC portraits using Midjourney V8, iterating on style with short prompts.
  • Marketing copy imagery: A content team generates custom illustrations for blog posts using GPT Image 2, which follows their detailed brand color specifications.
  • Custom model fine-tuning: A fashion brand trains a LoRA on their clothing catalog. The SD 3.5 base model + LoRA generates accurate on-model shots in any setting.
  • Video generation: A YouTuber uses Sora to create B-roll footage described in natural language, reducing filming costs.

06. Common Pitfalls and Misconceptions

"Diffusion models understand the image."
They do not store or retrieve images from a database. They generate statistically coherent pixels. There is no "memory" of training images in a retrievable form. Copyright cases hinge on this point.

"More denoising steps always means better quality."
Beyond a threshold (roughly 20-50 steps depending on the scheduler), additional steps yield diminishing returns and just consume more time.

"Negative prompts are how you control quality."
Negative prompts (words for the model to avoid) are a legacy technique from CFG-based diffusion. Newer models with instruction-following (GPT Image 2, Flux 2) respond better to direct positive instruction than to long negative prompt lists.

"Stable Diffusion and Flux are the same thing."
Flux is a separate model family from Black Forest Labs (founded by former Stability AI researchers). Flux 2 uses a DiT architecture and substantially outperforms SD 3.5 on most benchmarks.

"AI-generated images are always detectable."
Detection tools exist but have high false-positive rates and are defeated by minor post-processing. Watermarking standards (like C2PA) are being adopted but are not yet universal.

Verified against primary sources

Every claim traces to a cited source below.

Key terms

Diffusion model
A model that learns to reverse a noise-adding process to make images from text.
Text-to-image
Producing a new image from a natural-language prompt, not retrieval.
Autoregressive image generation
A newer class that builds image content sequentially.

Tags

#image-generation #diffusion #generative-ai #text-to-image #stable-diffusion

More in Images, Audio & Video