VQ-VAE & VQGAN: Architecture & How They Work

VQ-VAE (Vector Quantized Variational Autoencoder) and VQGAN (Vector Quantized GAN) learn discrete codebook representations of images, enabling powerful image generation by converting the continuous pixel space into a finite vocabulary of visual tokens that can be modeled with autoregressive or other sequence models.

Architecture Overview

VQ-VAE consists of an encoder, a discrete codebook, and a decoder. The encoder maps an input image to a grid of continuous latent vectors. Each latent vector is then quantized by finding the nearest vector in a learned codebook (dictionary) of K embedding vectors. The decoder reconstructs the image from these quantized latent codes.

The quantization process: given encoder output z_e(x), the discrete code is q(z_e) = e_k where k = argmin_j ||z_e - e_j||_2 (nearest neighbor lookup in the codebook). Since argmin is non-differentiable, the straight-through estimator copies gradients from the decoder input directly to the encoder output.

VQGAN improves upon VQ-VAE by using a more powerful encoder-decoder (with residual blocks and attention layers), adding a PatchGAN discriminator for adversarial training, and using a perceptual loss. This produces much sharper, more detailed reconstructions and a more expressive codebook.

Key Innovations

Discrete latent space: Quantizing to a finite codebook avoids posterior collapse (a problem in VAEs) and enables modeling with discrete sequence models like GPT
Straight-through estimator: Bypasses the non-differentiable quantization by copying gradients, enabling end-to-end training
Codebook learning: EMA (exponential moving average) updates for codebook vectors provide more stable training than gradient-based updates
VQGAN's adversarial training: Adding a discriminator and perceptual loss dramatically improves reconstruction quality, enabling high-fidelity image tokenization

Common Use Cases

Image generation (when combined with autoregressive models), image tokenization for multimodal LLMs, image compression, audio generation (SoundStream, Encodec), video tokenization, texture synthesis, and as the tokenizer in models like DALL-E (original), Parti, and Llamagen.

Notable Variants & Sizes

VQ-VAE (original), VQ-VAE-2 (hierarchical with multiple scales), VQGAN (adversarial training), RQ-VAE (residual quantization for better codebook utilization), FSQ (Finite Scalar Quantization, no codebook), LFQ (Lookup-Free Quantization). Codebook sizes range from 256 to 16384 entries. DALL-E 1 used a dVAE (discrete VAE) with 8192 tokens.

Technical Details

VQGAN (taming-transformers): Encoder/Decoder use ResNet blocks with self-attention at 16×16 resolution. Codebook: K=1024 vectors of dimension 256. Input 256×256 image → 16×16 grid of codes (16× spatial compression) = 256 tokens per image. Losses: L1 reconstruction + perceptual (LPIPS) + codebook commitment (β=0.25) + GAN (PatchGAN discriminator). Training: Adam, lr 4.5e-6, batch size 6-12, trained until convergence on ImageNet/OpenImages. Codebook EMA update: exponential moving average of encoder outputs assigned to each code. Usage for generation: train an autoregressive Transformer (GPT-like) on the 16×16 code sequences, then decode generated codes with the VQGAN decoder.