Vision Transformer (ViT): Architecture & How It Works

The Vision Transformer (ViT) applies the Transformer architecture directly to image recognition by treating an image as a sequence of patches, achieving state-of-the-art results on image classification benchmarks.

Architecture Overview

ViT divides an input image into fixed-size patches (typically 16×16 or 14×14 pixels), linearly embeds each patch into a vector, and processes the resulting sequence with a standard Transformer encoder. A special [CLS] token is prepended to the sequence, and its final representation is used for classification.

Each patch is flattened and projected through a linear embedding layer to produce patch embeddings. Learnable 1D position embeddings are added to retain spatial information. The sequence then passes through L layers of multi-head self-attention and MLP blocks with LayerNorm and residual connections.

The MLP blocks use GeLU activation and expand the hidden dimension by 4x. After the final Transformer layer, the [CLS] token representation is passed through a classification head (a simple MLP) to produce predictions.

Key Innovations

Patch tokenization: Treating image patches as tokens eliminates the need for convolutions entirely, proving that pure attention can match or exceed CNNs
Scale-dependent performance: ViT demonstrated that Transformers need large-scale pretraining (ImageNet-21k or JFT-300M) to outperform CNNs, but excel when data is sufficient
Transfer learning: Fine-tuning pretrained ViT on smaller datasets proved highly effective, with positional embedding interpolation enabling different resolutions
Simplicity: The architecture requires minimal image-specific inductive biases, relying on data and scale instead

Common Use Cases

Image classification, feature extraction for downstream vision tasks, backbone for object detection (ViTDet), semantic segmentation, and as the vision encoder in multimodal models like CLIP, LLaVA, and GPT-4V.

Notable Variants & Sizes

ViT-Tiny (6M params), ViT-Small (22M), ViT-Base (86M, 12 layers, 768-dim), ViT-Large (307M, 24 layers, 1024-dim), ViT-Huge (632M, 32 layers, 1280-dim), and ViT-Giant (1.8B+). Patch sizes of 32, 16, and 14 pixels trade off between sequence length and fine-grained detail.

Technical Details

ViT-Base/16 uses 12 layers, 12 attention heads, hidden dimension 768, and MLP dimension 3072. Input images are typically 224×224 or 384×384 pixels, producing sequences of 196 or 729 patches (plus [CLS]). Training uses AdamW with weight decay, learning rate warmup, and strong data augmentation (RandAugment, Mixup, CutMix). Pretraining on ImageNet-21k (14M images) or JFT-300M is standard before fine-tuning on target tasks.