01. What It Is
A multimodal model accepts and/or generates content in multiple modalities. The most common pairing is text plus images (called vision-language models, or VLMs), but frontier models increasingly handle audio, video, and structured document formats as well. The model does not just "see" an image and describe it. It reasons across modalities simultaneously, connecting visual evidence to textual logic.
The term "multimodal" covers both input and output. A model that accepts images but only outputs text is still multimodal. Models that generate images or audio from text prompts (like GPT Image 2 or Gemini's audio synthesis) are also multimodal, though they are architecturally different from the conversational VLMs discussed here.