Whisper: Architecture & How It Works

Whisper is OpenAI's general-purpose speech recognition model that approaches human-level robustness and accuracy by training on 680,000 hours of weakly supervised audio data from the web, handling transcription, translation, and language identification in a single model.

Architecture Overview

Whisper uses an encoder-decoder Transformer architecture. The audio encoder processes 30-second chunks of audio: raw audio is converted to an 80-channel log-Mel spectrogram, then two 1D convolution layers (with GELU activation) downsample and project the features. The resulting sequence is combined with sinusoidal positional embeddings and processed through Transformer encoder blocks.

The decoder is an autoregressive Transformer that generates text tokens conditioned on the encoder output via cross-attention. Special tokens control the task: language identification, transcription vs. translation, timestamp prediction, and whether to output timestamps. The decoder uses learned positional embeddings and standard causal self-attention.

At inference, the model processes audio in 30-second segments with a sliding window, using beam search or greedy decoding to generate the transcript.

Key Innovations

Weak supervision at scale: Training on 680K hours of internet audio with existing transcripts (rather than human-annotated data) enabled massive scale without expensive labeling
Multitask training format: A single model handles multiple tasks (transcription, translation, language ID, timestamp prediction) through special token prompting
Robustness: Diverse training data makes Whisper robust to accents, background noise, technical jargon, and domain-specific vocabulary without fine-tuning
Multilingual: Supports 99 languages with varying quality, and can translate from any supported language to English

Common Use Cases

Automatic speech recognition (ASR), real-time transcription, podcast/video transcription, meeting notes, subtitle generation, voice assistants, multilingual translation, and as the audio component in multimodal AI pipelines.

Notable Variants & Sizes

Whisper Tiny (39M, 4 layers), Base (74M, 6 layers), Small (244M, 12 layers), Medium (769M, 24 layers), Large (1.5B, 32 layers), Large-v2, Large-v3. Community projects include Faster-Whisper (CTranslate2-based, 4x faster), Whisper.cpp (CPU inference), and Distil-Whisper (distilled for speed).

Technical Details

Whisper Large: 32 encoder + 32 decoder layers, 20 attention heads, dim 1280, FFN dim 5120. Audio input: 16kHz sample rate, 80-channel log-Mel spectrogram with 25ms windows and 10ms stride, producing 3000 frames per 30-second chunk. After two conv layers (kernel 3, stride 2), this becomes 1500 time steps. Decoder vocab: 51865 tokens (multilingual) using GPT-2 BPE plus special tokens. Training: AdamW, linear warmup for 2048 steps, trained on 680K hours across 96 languages. The model uses FP16 inference and processes ~30 seconds of audio per forward pass.