03. How It Works
Automatic speech recognition (ASR)
ASR converts an audio waveform into a text transcript. Classical approaches used Hidden Markov Models (HMMs) combined with Gaussian mixture models to model phonemes, requiring separate acoustic models, pronunciation dictionaries, and language models. This worked reasonably well for narrow domains but was brittle across accents and noise levels.
Modern ASR is dominated by end-to-end neural approaches. The key steps are:
- Feature extraction. The raw waveform is converted to a log-Mel spectrogram, a 2D representation of how much energy is present at each frequency over time. This is far more compact than raw audio samples while preserving the acoustic information relevant to speech.
- Encoder. A neural network (typically a Transformer or conformer) encodes the spectrogram into a sequence of dense representations.
- Decoder. A second network decodes those representations into text tokens, either frame-by-frame (connectionist temporal classification, CTC) or with full attention (encoder-decoder Transformer).
OpenAI Whisper (2022) is the most widely used open-weight ASR model as of 2026. It was trained on 680,000 hours of multilingual and multitask supervised audio data collected from the web. Whisper uses an encoder-decoder Transformer. Input audio is split into 30-second chunks, converted to a log-Mel spectrogram, and passed through the encoder. The decoder then produces text with special tokens directing it to perform language identification, transcription, or translation into English. Whisper's zero-shot performance across diverse datasets is substantially more robust than models trained on narrow benchmarks. It makes roughly 50% fewer errors than earlier models when evaluated across varied real-world audio.
Google's Cloud Speech-to-Text API supports synchronous recognition (audio under 1 minute), asynchronous recognition (up to 480 minutes), and streaming recognition for real-time transcription. It also provides confidence scores for each recognized word and supports automatic punctuation.
Text-to-speech (TTS)
TTS converts written text into a spoken audio waveform. Classical TTS systems concatenated pre-recorded phoneme segments, which sounded robotic. Statistical parametric synthesis used models to generate smooth acoustic parameters, which was more natural but still artificial-sounding.
Modern neural TTS achieves near-human naturalness. The standard pipeline has two stages:
- Acoustic model. Takes phonemes or characters as input and produces a mel spectrogram. Models like Tacotron 2 use a sequence-to-sequence Transformer with attention.
- Vocoder. Converts the mel spectrogram to a raw waveform. WaveNet (Google DeepMind, 2016) was the first neural vocoder to produce convincingly natural speech. Modern vocoders like HiFi-GAN are fast enough for real-time synthesis.
More recent systems, including VALL-E (Microsoft, 2023) and Voicebox (Meta, 2023), use language model-style or diffusion-based generation and can synthesize speech from a reference audio clip of a few seconds.
Voice cloning
Voice cloning extends TTS to match a specific person's voice. Given a short audio sample (sometimes as short as three seconds for modern systems), a voice cloning model learns the speaker's acoustic characteristics, including prosody, timbre, and accent, and applies them to arbitrary new text.
Speaker embedding models (like d-vector or x-vector) extract a compact representation of a speaker's voice identity. This embedding conditions the TTS decoder so it produces audio that sounds like the target speaker. Eleven Labs, Resemble AI, and OpenAI all offer commercial voice cloning APIs.
The ethical dimension is significant. Voice cloning enables deepfakes and fraud. Countermeasures include watermarking synthesized audio and training detection classifiers that distinguish real from cloned speech.
Speaker diarization
Diarization answers the question "who spoke when?" in a multi-speaker recording. A diarization pipeline typically:
- Uses voice activity detection to find speech segments.
- Extracts speaker embeddings for each segment.
- Clusters embeddings by speaker identity using techniques like agglomerative hierarchical clustering.
- Assigns a speaker label to each time interval.
Diarization is critical for meeting transcription, courtroom recordings, and medical dictation where attributing speech to the correct speaker matters.
Audio and music generation
AudioLM (Google, 2022) is a language model trained on audio tokens. It models audio as a sequence of discrete tokens (from a neural audio codec like EnCodec) and generates continuations that preserve acoustic structure, including background noise, room acoustics, and speaker identity, without explicit signal processing.
MusicLM (Google, 2023) extends this approach to music. Given a text prompt ("a calming piano piece with a jazz feel"), it generates multi-minute musical compositions that match the description. It was trained on a large dataset of music with text descriptions and produces coherent structure across multiple minutes, which earlier models could not maintain.
Suno (2024) is a commercial music generation service that generates full songs with vocals and instrumentation from a short prompt. It uses a combination of language model-style sequence generation and neural audio synthesis. As of 2025-2026, Suno and Udio are the leading consumer-facing music generation tools.
The underlying pattern across these systems is similar: encode audio into a discrete token sequence, train a language model (or diffusion model) on those sequences, and decode back to audio. This is the same approach used for images (with image tokens) and text, which is why multimodal models can in principle handle all three modalities with a unified architecture.
Real-time voice agents
A real-time voice agent combines ASR, an LLM, and TTS in a pipeline with total latency low enough for natural conversation (below 500ms end-to-end). The main challenges are:
- ASR must produce a partial transcript before the user finishes speaking (streaming transcription).
- The LLM must respond quickly, which typically means using a smaller or distilled model.
- TTS must begin synthesizing before the full text response is available (streaming synthesis).
- Turn detection (knowing when the user has finished speaking) must be reliable.
End-to-end audio models like GPT-4o's audio mode skip the ASR step entirely. The model processes audio tokens directly, which reduces latency and allows it to respond to acoustic cues like tone or hesitation that are lost in transcription.