Diffusion Transformer (DiT): Architecture & How It Works

The Diffusion Transformer (DiT) replaces the traditional U-Net backbone in diffusion models with a Transformer architecture, achieving superior image generation quality with better scalability properties.

Architecture Overview

DiT operates in the latent space of a pretrained variational autoencoder (VAE). Input images are first encoded to latent representations, then noise is added according to the diffusion schedule. The noisy latents are patchified (similar to ViT) and processed by a sequence of DiT blocks that predict the noise to be removed.

Each DiT block contains multi-head self-attention and an MLP, with adaptive layer normalization (adaLN-Zero) conditioning on both the diffusion timestep and class label. The timestep and class embeddings are processed through an MLP to produce scale and shift parameters that modulate the layer normalization.

The final layer applies a linear decoder to map from the Transformer's hidden dimension back to the patch space, producing the predicted noise (or velocity) for each patch. These patches are then unpatchified to reconstruct the full latent, which the VAE decoder converts back to pixel space.

Key Innovations

adaLN-Zero conditioning: Instead of cross-attention for conditioning, DiT uses adaptive layer norm with zero-initialized parameters, providing effective conditioning while maintaining training stability
Scalable architecture: Unlike U-Nets, DiT scales smoothly with compute—larger models consistently produce better results following predictable scaling laws
Latent space operation: Working in the VAE's compressed latent space (typically 32×32 or 64×64) makes the Transformer's quadratic attention cost manageable
Gflop-performance correlation: DiT showed a strong correlation between model compute (Gflops) and generation quality (FID), similar to scaling laws in language models

Common Use Cases

High-quality image generation, class-conditional image synthesis, and as the backbone for text-to-image systems. DiT's architecture directly inspired Stable Diffusion 3, DALL-E 3, and Sora (video generation).

Notable Variants & Sizes

DiT-S (33M parameters), DiT-B (130M), DiT-L (458M), DiT-XL (675M). The XL/2 variant (patch size 2) at 256×256 resolution achieved state-of-the-art FID of 2.27 on ImageNet. Subsequent models like SD3's MMDiT and Sora scale to billions of parameters.

Technical Details

DiT-XL/2 uses 28 layers, 16 attention heads, hidden dimension 1152, and MLP dimension 4608. It operates on 32×32×4 latent representations from a pretrained VAE, with patch size 2 producing 256 tokens. Training uses AdamW with constant learning rate and no weight decay on norms/biases. The model is trained for 400K-7M steps with classifier-free guidance (scale 1.5-4.0) at inference time to improve sample quality.