GAN (Generative Adversarial Network): Architecture & How It Works

Generative Adversarial Networks (GANs) learn to generate realistic data through an adversarial game between two neural networks—a generator that creates samples and a discriminator that tries to distinguish real from generated samples—driving each other to improve.

Architecture Overview

A GAN consists of two networks trained simultaneously. The generator G takes a random noise vector z ~ N(0,I) and transforms it into a data sample G(z). The discriminator D takes an input (either real data x or generated data G(z)) and outputs a probability that the input is real.

Training alternates between two objectives: the discriminator maximizes log(D(x)) + log(1-D(G(z))), learning to correctly classify real vs. fake. The generator minimizes log(1-D(G(z))) (or equivalently maximizes log(D(G(z))) for better gradients), learning to fool the discriminator. This min-max game is: min_G max_D E[log D(x)] + E[log(1-D(G(z)))].

In a DCGAN (Deep Convolutional GAN), the generator uses transposed convolutions (fractionally-strided convolutions) to progressively upsample from a noise vector to a full image: z (100-dim) → 4×4×1024 → 8×8×512 → 16×16×256 → 32×32×128 → 64×64×3. The discriminator mirrors this with strided convolutions for downsampling.

Key Innovations

Adversarial training: Two networks competing creates an implicit density model without requiring explicit likelihood computation
DCGAN architectural guidelines: Batch normalization, ReLU in generator, LeakyReLU in discriminator, no fully connected layers, and strided convolutions replaced pooling—providing stable training recipes
Wasserstein GAN (WGAN): Replaced JS divergence with Earth Mover's distance, providing meaningful loss that correlates with sample quality and more stable training
Conditional GANs: Adding class labels or other conditioning information to both G and D enables controlled generation

Common Use Cases

Image generation, image-to-image translation (pix2pix, CycleGAN), super-resolution (SRGAN, ESRGAN), data augmentation, style transfer, face generation and editing, text-to-image synthesis (pre-diffusion era), video generation, and adversarial training for robustness.

Notable Variants & Sizes

DCGAN (2015), WGAN/WGAN-GP (2017, gradient penalty), Progressive GAN (2018, grow resolution during training), BigGAN (2019, large-scale class-conditional), StyleGAN (2019-2024, style-based generation). Pix2pix and CycleGAN for paired/unpaired image translation. GAN discriminators also appear in VQGAN, diffusion model training (SDXL-Turbo), and modern vocoders (HiFi-GAN).

Technical Details

DCGAN generator: z (100-dim) → FC to 4×4×1024 → 4 transposed conv layers (stride 2, BN, ReLU) → 64×64×3 (tanh). Discriminator: mirrors with strided convolutions (stride 2, BN, LeakyReLU 0.2) → sigmoid. WGAN-GP: critic (no sigmoid) with gradient penalty λ=10, n_critic=5 (train critic 5× per generator step). Training: Adam (lr=0.0002, β1=0.5, β2=0.999), batch size 64-128. Training is notoriously unstable: mode collapse (generator produces limited variety), training oscillation, and vanishing gradients are common challenges. Evaluation metrics: FID (Fréchet Inception Distance, lower is better), IS (Inception Score), and qualitative assessment. BigGAN: class-conditional with 256×256 output, ~160M params, achieves FID 6.9 on ImageNet.