StyleGAN: Architecture & How It Works

StyleGAN is a revolutionary generative architecture that produces photorealistic images by borrowing from neural style transfer, using a mapping network and adaptive instance normalization to control image generation at multiple scales of detail.

Architecture Overview

StyleGAN replaces the standard GAN generator with a style-based design. Instead of feeding the noise vector z directly to the generator, it first passes through a mapping network (8-layer MLP) to produce an intermediate latent code w in W-space. This w vector is then transformed into per-layer style vectors that modulate the generator's features via Adaptive Instance Normalization (AdaIN).

The synthesis network starts from a learned constant 4×4 feature map (not from noise). At each resolution level (4×4 → 8×8 → ... → 1024×1024), the style vector w is linearly transformed to produce scale (γ) and bias (β) parameters that are applied after instance normalization of the feature maps: AdaIN(x, y) = y_s · (x - μ(x))/σ(x) + y_b. Stochastic variation is added via per-pixel noise injection at each layer.

StyleGAN2 replaced AdaIN with weight demodulation (modulating convolution weights directly), removed progressive growing in favor of a fixed architecture with skip connections, and introduced path length regularization for smoother latent space.

Key Innovations

W-space (mapping network): Transforming z→w through an MLP creates a more disentangled latent space where linear interpolations produce meaningful semantic changes
Style injection at multiple scales: Coarse styles (4-8px: pose, face shape) and fine styles (64-1024px: colors, textures) can be controlled independently by injecting different w vectors at different layers
Stochastic noise injection: Per-pixel noise at each layer controls stochastic details (hair strands, freckles, pores) independently of global structure
Style mixing: Using different w vectors for different layers enables mixing attributes from different generated images (style mixing regularization also improves training)

Common Use Cases

Photorealistic face generation, face editing (age, expression, pose, accessories), art and creative generation, data augmentation for face recognition, deepfake generation (and detection research), domain-specific generation (cars, churches, animals), and GAN inversion for real image editing.

Notable Variants & Sizes

StyleGAN (2019, 26.2M), StyleGAN2 (2020, improved quality), StyleGAN2-ADA (adaptive discriminator augmentation for limited data), StyleGAN3 (2021, alias-free, equivariant generation), StyleGAN-XL (ImageNet-scale, 1024 classes), StyleGAN-T (text-guided). EG3D extends StyleGAN to 3D-aware face generation. GAN inversion methods: e4e, ReStyle, PTI.

Technical Details

StyleGAN2 (config F): Mapping network: 8 FC layers, 512-dim each, LReLU. Synthesis network: constant 4×4×512 input, progressive upsampling to 1024×1024 with [512, 512, 512, 512, 256, 128, 64, 32] channels per resolution. Each block: modulated conv (3×3) → noise → LReLU → modulated conv → noise → LReLU, with skip connections (output skips). Weight demodulation: w'_ijk = w_ijk · s_i / √(Σ_{i,k} (w_ijk · s_i)² + ε), where s is the style scaling. Training: Adam (lr=0.002, β1=0, β2=0.99), R1 gradient penalty (γ=10), non-saturating logistic loss. 8 V100 GPUs, ~1 week at 1024×1024 on FFHQ (70K faces). FID: 2.84 on FFHQ 1024×1024. Path length regularization: penalizes ||J^T_w · y||_2 deviation from its expected value, encouraging the mapping to preserve distances.