VAE (Variational Autoencoder): Architecture & How It Works

The Variational Autoencoder (VAE) is a generative model that learns a continuous latent representation of data by combining an autoencoder architecture with variational Bayesian inference, enabling both meaningful data compression and generation of new samples.

Architecture Overview

A VAE consists of an encoder (inference network) q_φ(z|x) that maps input data to a distribution over latent codes, and a decoder (generative network) p_θ(x|z) that reconstructs data from latent codes. Unlike a standard autoencoder, the VAE encoder outputs distribution parameters (mean μ and log-variance log σ²) rather than point estimates.

During training: the encoder produces μ and log σ² from input x. A latent code is sampled using the reparameterization trick: z = μ + σ ⊙ ε, where ε ~ N(0,I). This makes sampling differentiable, enabling backpropagation through the stochastic node. The decoder reconstructs x from z. The loss combines reconstruction quality and latent regularity.

The training objective (Evidence Lower Bound, ELBO) is: L = E_{q(z|x)}[log p(x|z)] - KL(q(z|x) || p(z)), where the first term is the reconstruction loss (e.g., MSE or binary cross-entropy) and the second is the KL divergence between the learned posterior and the prior p(z) = N(0,I), which regularizes the latent space.

Key Innovations

Reparameterization trick: Expressing z = μ + σ·ε moves stochasticity to an input (ε), enabling gradient flow through the sampling operation
Principled latent regularization: The KL divergence term encourages a smooth, continuous latent space where nearby points decode to similar outputs, enabling interpolation and generation
Variational inference framework: Casts generation as approximate Bayesian inference, providing a theoretically grounded objective (ELBO)
Continuous latent space: Unlike discrete autoencoders, the smooth Gaussian latent space supports meaningful interpolation and arithmetic in latent space

Common Use Cases

Image generation, data augmentation, anomaly detection (high reconstruction error = anomaly), drug discovery (molecular generation), latent space interpolation, representation learning, missing data imputation, and as the compression component in latent diffusion models (Stable Diffusion's VAE).

Notable Variants & Sizes

Standard VAE (Kingma & Welling, 2014), β-VAE (disentangled representations with β > 1 on KL term), VQ-VAE (discrete latent space), NVAE (hierarchical VAE, SOTA image generation), VAE-GAN (adversarial decoder), CVAE (conditional generation), Ladder VAE (hierarchical with skip connections). Stable Diffusion's KL-f8 VAE: encoder/decoder with ~83M params, latent dim 4, spatial compression 8×.

Technical Details

Standard image VAE: Encoder: Conv2d layers (3→32→64→128→256) with stride 2, ReLU, BatchNorm → flatten → FC to 2×d_latent (μ and log σ²). Decoder: FC → reshape → ConvTranspose2d layers mirroring encoder → sigmoid/tanh output. Latent dim: 2-512 depending on data complexity. KL divergence (closed form for Gaussians): KL = -0.5 Σ(1 + log σ² - μ² - σ²). Common issue: posterior collapse (KL→0, decoder ignores z), addressed by KL annealing (warm up β from 0→1), free bits, or cyclical annealing. Training: Adam (lr=1e-3 to 1e-4), batch size 64-256. On MNIST with d_latent=20: ~1M params, smooth latent space with digit transitions visible in 2D projections.