Skip to content

Generative AI

The Basics 10 min read

In Short

Generative AI is a category of machine learning in which models learn the statistical structure of training data and then produce new content that resembles it. Unlike systems that classify or predict from existing data, generative models create: text, images, audio, video, code, and more. The field was reshaped by transformers in 2017, exploded with diffusion models and large language models, and by 2024 had entered a multimodal, agentic phase that continues through 2026.

01. What It Is

Generative AI refers to deep-learning systems trained to produce new data that is statistically similar to the data they were trained on. The model learns a compressed internal representation of its training corpus, then samples from that representation to create novel output. Concretely: feed a model all of Wikipedia and it learns to write prose. Feed it millions of photographs and it learns to render plausible new scenes.

The output can be text, images, audio, video, 3D geometry, software code, or combinations of these. The model does not retrieve a stored answer. It generates one, token by token or pixel by pixel, from a learned probability distribution over possible outputs.

This is fundamentally different from how most people picture "AI" in the traditional sense. Earlier AI systems were mostly discriminative: they told you which category something belonged to, or predicted a numerical outcome, but they did not create anything new.

02. Why It Matters

Before generative AI reached its current scale, building an AI product meant collecting labeled data, training a narrow model for one specific task, and deploying it for that task only. The model that classified customer sentiment could not write a summary, and the model that detected tumors in X-rays could not explain its reasoning.

Generative AI, especially in the foundation model form, broke that pattern. A single model trained on general data can be steered toward many tasks through prompting or light fine-tuning. This slashed the cost of entering new AI use cases by orders of magnitude and brought the technology within reach of individual developers rather than only large research labs.

03. How It Works: The Model Families

There are several distinct technical families, each with different strengths.

Autoregressive transformers (LLMs and autoregressive image models) These models predict the next token in a sequence, conditioned on all previous tokens. For text, a token is roughly a word fragment. The transformer architecture, introduced by Google researchers Vaswani et al. in 2017 in "Attention Is All You Need" (arXiv:1706.03762), replaced earlier recurrent networks by processing all positions in a sequence in parallel, enabling training on vastly more data. GPT-series models, LLaMA, Gemini, and Claude are all autoregressive transformers. The same principle applies to autoregressive image models such as DALL-E (first version) and later VQ-VAE-based image generation, where pixels or image tokens are predicted sequentially.

Diffusion models Diffusion models work in two stages. In the forward pass, Gaussian noise is added to real data incrementally until the data becomes indistinguishable from random noise. The model is then trained to reverse this process, denoising the data step by step. At inference time, the model starts from pure noise and iteratively denoises it into a coherent image, audio clip, or video segment. Stable Diffusion, DALL-E 2 and 3, Midjourney, and Sora-style video models are diffusion-based. As of 2024, diffusion models dominate image and video generation due to their output quality and controllability. (See also: Diffusion and Image Generation.)

Generative Adversarial Networks (GANs) Introduced by Ian Goodfellow and colleagues in 2014, GANs train two networks against each other: a generator that produces fake samples and a discriminator that tries to distinguish fake from real. The adversarial dynamic drives the generator toward increasingly realistic outputs. GANs produced the first photorealistic AI portraits and deepfakes, and remained dominant in image synthesis through roughly 2021, when diffusion models surpassed them in most benchmarks. GANs remain useful for real-time generation tasks where diffusion inference latency is prohibitive.

Variational Autoencoders (VAEs) VAEs, introduced by Kingma and Welling in 2013, were among the first deep generative models to scale to complex data. An encoder compresses input data into a latent probability distribution, and a decoder samples from that distribution to reconstruct (and vary) the output. VAEs enabled early realistic image and speech synthesis and established the encoder-decoder pattern that underlies most modern architectures. They are often used today as a component inside larger systems rather than as standalone generators.

Flow-based models Normalizing flows learn an invertible mapping between data and a simple distribution (typically Gaussian). Because the mapping is invertible, exact likelihoods can be computed, which is not possible with GANs. Flow models are less common in end-user products but are used in specialized scientific applications such as molecular generation. Glow (OpenAI, 2018) is a well-known example.

04. Modalities

Generative AI has expanded well beyond text and images:

  • Text:
    Chat, summarization, translation, question answering, document drafting. LLMs are the dominant approach.
  • Image:
    Photorealistic renders, illustrations, concept art, image editing. Diffusion models with text conditioning dominate.
  • Audio and music:
    Speech synthesis (text-to-speech), voice cloning, music generation (Suno, Udio), sound design.
  • Video:
    Short-clip generation, scene extension, video editing. Sora (OpenAI, 2024) and Veo (Google, 2024) demonstrated transformer-diffusion hybrids at cinematic quality.
  • Code:
    Autocomplete, test generation, full program synthesis. GitHub Copilot, Claude, and Gemini are all capable code generators.
  • 3D and spatial content:
    Point cloud generation, mesh synthesis, scene reconstruction. Relevant to games, AR/VR, and robotic simulation.
  • Multimodal:
    Models that accept and produce multiple modalities in a single system. GPT-4o, Gemini 1.5/2.0, and Claude 3 and later versions handle text, images, audio, and video inputs. (See also: Multimodal Models.)

05. Generative vs. Discriminative / Predictive AI

This distinction is foundational and worth stating precisely.

A discriminative model learns the conditional probability P(Y|X): given input X, what is the most likely label or output Y? It draws a boundary between categories. A spam filter, a sentiment classifier, a credit-scoring model, and an image recognition system (is this a cat or not?) are all discriminative. They answer questions about existing data.

A generative model learns the joint probability P(X, Y), or equivalently P(X|Y) together with P(Y), which describes how the data itself is distributed. Because it models the full data distribution, it can sample from that distribution to produce new instances of X. This is what makes generation possible.

Predictive AI is a broader term that encompasses both: any model that produces an output from an input. In practice, "predictive AI" usually signals discriminative or regression tasks in enterprise analytics (forecasting sales, predicting churn, detecting fraud), while "generative AI" signals creative and synthesis tasks. The technical boundary between them is real but not absolute: an autoregressive language model predicts the next token, yet doing so sequentially produces generated text. The prediction mechanism is generative in its effect.

See also: Types of AI and AI, ML, Deep Learning, LLMs, and Algorithms: The Differences for the broader taxonomy.

06. Where It Sits: Generative AI and Its Relationship to LLMs

The relationship is one of category and member.

Generative AI is the broad category: any AI system that generates new content.

Large language models (LLMs) are one specific type of generative AI: autoregressive transformer models trained on text at scale. LLMs generate text. They are not inherently multimodal, though most production LLMs as of 2025 have been extended with vision and audio capabilities.

Other generative AI systems (diffusion image models, music generators, video models) are not LLMs. The term "LLM" should not be used as a synonym for "generative AI," though it often is colloquially.

See What Is a Large Language Model? for a full treatment of what makes a model an LLM.

07. Foundation Models and "Pretrain Once, Adapt Many"

Before foundation models, AI development meant training a new model for each task, requiring labeled data for each one. Foundation models changed this by training a single large model on massive, general unlabeled data, then adapting it to specific tasks through fine-tuning or prompting.

The term "foundation model" was popularized by the Stanford Center for Research on Foundation Models in 2021. As Wikipedia notes, building a foundation model is highly resource-intensive (training costs can reach hundreds of millions of dollars), but adapting one is comparatively cheap. This created an economic structure where a small number of organizations train the base models, and many more build on top of them.

Key properties of foundation models:

  • Trained on broad, general data (text from the internet, image-caption pairs, code repositories)
  • Capable of zero-shot and few-shot generalization to new tasks without retraining
  • Adaptable via fine-tuning, prompt engineering, or parameter-efficient methods such as LoRA

GPT-3, GPT-4, Gemini, Claude, Llama, and Stable Diffusion are all foundation models. Not all foundation models are generative (BERT is a discriminative foundation model used for search and classification), but the most commercially visible ones are.

08. A Short History


  1. Variational Autoencoders (VAEs) introduced by Kingma and Welling. First scalable deep generative models for images and speech.

  2. Generative Adversarial Networks (GANs) introduced by Goodfellow et al. Begin the era of photorealistic image synthesis.

  3. Diffusion models introduced conceptually, drawing from non-equilibrium thermodynamics.

  4. Google researchers publish "Attention Is All You Need," introducing the transformer architecture. Enables training at dramatically larger scale than recurrent networks.

  5. BERT and GPT-1/2 demonstrate the power of large pretrained language models. The "pretrain then fine-tune" paradigm becomes standard.

  6. GPT-3 (175 billion parameters) released by OpenAI. Zero-shot and few-shot capabilities attract wide attention. DDPM (Denoising Diffusion Probabilistic Models) paper establishes the modern diffusion model framework.

  7. DALL-E (OpenAI) and CLIP demonstrate text-to-image generation. Foundation model terminology formalized by Stanford.

  8. Stable Diffusion released as open weights. DALL-E 2 and Midjourney launch publicly. ChatGPT released in November. Public inflection point: generative AI enters mainstream use.

  9. GPT-4 with vision, Llama 1/2 (Meta), Claude 1/2 (Anthropic), Gemini (Google). The open-weights and frontier-closed model ecosystem splits into two competing paradigms.

  10. Multimodal models become standard. Sora demonstrates high-quality video generation. GPT-4o, Gemini 1.5 Pro with 1M context windows. Agentic AI, where models take sequences of actions through tool use, becomes a primary product pattern.

  11. Claude 3/4, Gemini 2.0/3.0, GPT-4.5/5 series. Reasoning models (chain-of-thought with extended compute at inference time) emerge as a distinct capability tier. Real-time voice and video interaction, multimodal agents operating across desktop and browser environments.

09. Use Cases

Enterprise and knowledge work:
Document drafting, summarization, contract review, email drafting, meeting notes, internal search, code generation, customer support automation.

Creative production:
Image and illustration generation, marketing copy, music and audio, video production, game asset creation, screenwriting assistance.

Software development:
Code autocomplete, test generation, documentation, debugging assistance. GitHub Copilot and integrated AI coding assistants have become standard developer tools.

Science and research:
Drug and molecule discovery, protein structure prediction support, literature synthesis, simulation data generation.

Education:
Tutoring, personalized explanation, language learning, content generation for curricula.

Consumer:
Conversational assistants, AI companions, personalized content, image editing.

10. Common Pitfalls and Limitations

Hallucination:
Generative models produce statistically plausible text, not factually verified text. An LLM may confidently state a false citation, a wrong date, or a fabricated statistic. This is a structural property of how these models work, not a bug that will be patched. Mitigation strategies (retrieval-augmented generation, grounding, citations) reduce but do not eliminate the problem.

Copyright and training data:
Models trained on internet-scale data have ingested copyrighted material. Whether generating content similar to training data constitutes infringement is unresolved in most jurisdictions as of 2026. Several lawsuits are ongoing.

Compute cost:
Training frontier models costs hundreds of millions of dollars. Inference costs are declining but remain nontrivial at scale, and real-time multimodal inference remains expensive.

Non-determinism:
The same prompt will produce different outputs on different runs. This is by design (sampling from a probability distribution) but makes generative AI unsuitable for applications requiring exact reproducibility without additional constraints.

Bias and fairness:
Models inherit statistical patterns from training data, including demographic biases, cultural assumptions, and toxic content. Alignment techniques (RLHF and variants) reduce harmful outputs but do not eliminate them.

Context window limits:
Even models with very long context windows (1M tokens as of 2025) degrade in recall and coherence at the far end of long inputs. Long-document tasks require careful engineering.

Evaluation difficulty:
It is hard to objectively measure the quality of generated content. Unlike classification, there is no single ground-truth answer, which makes benchmarking and regression testing more complex.