Stable Diffusion & Latent Diffusion Models: Architecture & How They Work

Latent Diffusion Models (LDMs), commercialized as Stable Diffusion, generate high-quality images by performing the diffusion process in a compressed latent space rather than pixel space, dramatically reducing computational requirements while maintaining image quality.

Architecture Overview

LDMs consist of three main components: a Variational Autoencoder (VAE) that compresses images to and from a latent space, a U-Net (or DiT in newer versions) that performs the denoising diffusion process in latent space, and conditioning modules (text encoder, typically CLIP) that guide generation.

The process works as follows: During training, images are encoded to latents via the VAE encoder (typically 8× spatial compression), noise is added according to a schedule, and the U-Net learns to predict the noise. During generation, pure noise is iteratively denoised by the U-Net over many steps (20-50 typically), conditioned on text embeddings via cross-attention layers. The final denoised latent is decoded to pixel space by the VAE decoder.

The U-Net contains ResNet blocks for spatial processing, self-attention blocks for global context, and cross-attention blocks that attend to the text encoder's output, enabling text-conditional generation.

Key Innovations

Latent space diffusion: Operating in compressed latent space (e.g., 512×512×3 → 64×64×4) reduces computation by ~50× compared to pixel-space diffusion while preserving perceptual quality
Cross-attention conditioning: Injecting text embeddings via cross-attention at multiple U-Net layers enables flexible, detailed text-to-image control
Classifier-free guidance (CFG): Training with random dropping of conditioning and interpolating between conditional and unconditional predictions at inference improves adherence to text prompts
Pretrained components: Using a separately trained VAE and text encoder (CLIP) allows each component to be optimized independently

Common Use Cases

Text-to-image generation, image-to-image translation, inpainting, outpainting, super-resolution, style transfer, concept customization (DreamBooth, LoRA), controlnet-guided generation, and video generation (extending to temporal dimensions).

Notable Variants & Sizes

Stable Diffusion 1.5 (860M U-Net), SD 2.0/2.1 (OpenCLIP text encoder), SDXL (2.6B U-Net + refiner), SD 3.0 (MMDiT replacing U-Net, 2B), Stable Diffusion 3.5 (8B). DALL-E 2 (diffusion + CLIP prior), Imagen (T5 text encoder + pixel-space cascaded diffusion), Playground, and various community fine-tunes and LoRAs.

Technical Details

SD 1.5: VAE (encoder 34M + decoder 49M, latent dim 4, compression 8×), U-Net (860M, 4 down/up stages, channels [320, 640, 1280, 1280], ResBlocks + SelfAttn + CrossAttn at 64/32/16/8 spatial res), CLIP ViT-L/14 text encoder (123M, 77 token max, 768-dim). Training: 256 A100s, batch size 2048, 600K steps at 256×256 then fine-tuned at 512×512. Noise schedule: linear β from 0.00085 to 0.012, 1000 diffusion steps. Inference: DDPM, DDIM, or DPM-Solver samplers in 20-50 steps. CFG scale typically 7-11. SDXL doubles U-Net channels to [320, 640, 1280] with two text encoders (CLIP ViT-L + OpenCLIP ViT-bigG) for 2048-dim conditioning.