Swin Transformer: Architecture & How It Works

The Swin Transformer introduces a hierarchical vision Transformer that computes attention within shifted windows, achieving linear computational complexity with respect to image size while building multi-scale feature maps.

Architecture Overview

Swin Transformer processes images through four hierarchical stages, progressively reducing spatial resolution while increasing channel dimension—similar to how CNNs like ResNet work. The input image is split into 4×4 patches, linearly embedded, and then processed through stages of Swin Transformer blocks.

Each stage consists of multiple Swin blocks paired in twos. The first block computes self-attention within non-overlapping local windows (typically 7×7 patches). The second block shifts the window partition by half the window size, enabling cross-window connections. Between stages, patch merging layers concatenate 2×2 neighboring patches and project to 2× channels, halving spatial resolution.

This creates a feature pyramid with resolutions of H/4, H/8, H/16, and H/32, making Swin directly compatible with FPN-based detection and segmentation frameworks.

Key Innovations

Window attention: Computing attention within local windows reduces complexity from O(n²) to O(n) relative to image size, enabling processing of high-resolution images
Shifted windows: Alternating between regular and shifted window partitions creates cross-window connections without additional parameters, enabling global information flow
Hierarchical features: Multi-scale feature maps make Swin a drop-in backbone replacement for CNNs in detection (Faster R-CNN, Mask R-CNN) and segmentation (UPerNet) frameworks
Efficient masking: Cyclic shifting and attention masking implement shifted windows without padding overhead

Common Use Cases

Image classification, object detection (COCO), instance segmentation, semantic segmentation (ADE20K), video recognition, and as a general-purpose vision backbone replacing CNNs in virtually any computer vision pipeline.

Notable Variants & Sizes

Swin-Tiny (28M, 96-dim), Swin-Small (50M, 96-dim), Swin-Base (88M, 128-dim), Swin-Large (197M, 192-dim). Swin-V2 scales to 3B parameters with res-post-norm, scaled cosine attention, and log-spaced continuous position bias. SwinIR adapts the architecture for image restoration.

Technical Details

Swin-Base: 4 stages with [2, 2, 18, 2] blocks, channel dims [128, 256, 512, 1024], window size 7×7, 4/8/16/32 attention heads per stage. Input 224×224 with 4×4 patches produces 56×56 initial tokens. Training uses AdamW, cosine schedule, 300 epochs on ImageNet-1K with RandAugment, Mixup, CutMix, random erasing, and stochastic depth (drop path rate 0.1-0.5).