Consistency Models: Architecture & How They Work

Consistency Models are a new family of generative models that enable high-quality single-step image generation by learning to map any point along a diffusion trajectory directly to the trajectory's origin (the clean data point), without requiring iterative denoising.

Architecture Overview

A consistency model f_θ(x_t, t) learns a function with the consistency property: for any two points x_t and x_t' on the same probability flow ODE trajectory, f_θ(x_t, t) = f_θ(x_t', t'). In other words, all points on a trajectory map to the same output—the clean data point x_0 at the trajectory's origin.

The model architecture is based on a standard diffusion model backbone (U-Net or DiT) with two modifications: a skip connection that ensures f_θ(x, ε) = x (boundary condition at the smallest timestep), and timestep conditioning. The skip connection is implemented as: f_θ(x, t) = c_skip(t)·x + c_out(t)·F_θ(x, t), where c_skip and c_out are differentiable functions with c_skip(ε)=1, c_out(ε)=0.

Two training approaches exist: Consistency Distillation (CD) distills from a pretrained diffusion model by enforcing consistency between adjacent ODE trajectory points, and Consistency Training (CT) learns directly from data without a teacher model using a modified objective.

Key Innovations

Single-step generation: After training, images can be generated in a single forward pass (t=T → x_0), unlike diffusion models requiring 20-1000 steps
Flexible quality-compute tradeoff: While single-step works, additional steps progressively improve quality—the model supports 1, 2, 4, or more refinement steps
Self-consistency property: The mathematical constraint that all trajectory points map to the same output provides a principled training signal
Zero-shot capabilities: The consistency property enables image editing tasks (inpainting, colorization, super-resolution) without task-specific training

Common Use Cases

Fast image generation (single-step or few-step), real-time image synthesis, image editing (inpainting, interpolation), super-resolution, and as a fast alternative to diffusion models in latency-sensitive applications.

Notable Variants & Sizes

Consistency Models (original), Latent Consistency Models (LCM, operating in latent space for Stable Diffusion), Improved Consistency Training (iCT, better training recipe), Consistency Trajectory Models (CTM, generalize both consistency and diffusion). LCM-LoRA enables 2-4 step generation from any SD model. Sizes follow the backbone: 400M-2B+ parameters.

Technical Details

Architecture: same as EDM diffusion model (U-Net with attention), with skip connection parameterization c_skip(t) = σ_data²/(t² + σ_data²), c_out(t) = σ_data·t/√(t² + σ_data²). CD training: sample x_0, add noise at t_{n+1} and t_n (adjacent timesteps), use ODE solver for one step t_{n+1}→t_n, minimize ||f_θ(x_{t_{n+1}}, t_{n+1}) - f_θ_(x̂_{t_n}, t_n)||², where θ_ is EMA of θ. CT training: similar but without the ODE step. Schedule: N steps from 2→150 over training, EMA rate μ=0.999→0.9999. On CIFAR-10: single-step FID 3.55 (CD), 2-step FID 2.93. LCM adds CFG scale embedding and operates in Stable Diffusion's latent space for 2-4 step generation.