CLIP: Architecture & How It Works

CLIP (Contrastive Language-Image Pre-training) learns to connect images and text by training on 400 million image-text pairs from the internet, enabling zero-shot visual classification and serving as a foundational component in modern multimodal AI systems.

Architecture Overview

CLIP consists of two parallel encoders: a vision encoder (either a ViT or modified ResNet) and a text encoder (a Transformer). Both encoders independently process their respective inputs and project them into a shared embedding space of the same dimensionality.

The vision encoder processes images into visual embeddings. For ViT variants, images are patchified and processed through standard Transformer layers, with the [CLS] token projected to the shared space. The text encoder is a 12-layer Transformer with masked self-attention, processing tokenized text (BPE, 49152 vocab) up to 77 tokens, with the [EOS] token representation projected to the shared space.

During training, CLIP processes batches of N image-text pairs and computes an N×N matrix of cosine similarities. The training objective (InfoNCE loss) maximizes similarity for the N correct pairs and minimizes it for the N²-N incorrect pairs, effectively performing N-way classification in both directions.

Key Innovations

Zero-shot transfer: By learning from natural language supervision, CLIP can classify images into any categories described in text without task-specific training
Contrastive learning at scale: Training on 400M noisy image-text pairs from the web proved more effective than smaller curated datasets
Natural language as supervision: Using free-form text descriptions instead of fixed label sets creates more flexible and transferable representations
Prompt engineering: Classification accuracy can be improved by engineering text prompts (e.g., "a photo of a {class}, a type of pet" instead of just "{class}")

Common Use Cases

Zero-shot image classification, image-text retrieval, image search, as the text encoder for Stable Diffusion, as vision encoder for multimodal models (LLaVA), content moderation, and visual question answering. CLIP embeddings are widely used for similarity search and filtering.

Notable Variants & Sizes

CLIP ViT-B/32 (151M), CLIP ViT-B/16 (150M), CLIP ViT-L/14 (428M), CLIP ViT-L/14@336px. OpenCLIP provides open-source reproductions trained on LAION-2B. SigLIP replaces InfoNCE with sigmoid loss for better efficiency. EVA-CLIP scales to ViT-E (4.4B).

Technical Details

CLIP ViT-L/14: 24-layer ViT (vision) + 12-layer Transformer (text), shared embedding dim 768, trained on WIT-400M (WebImageText). Vision: 14×14 patches, 16 heads, 1024-dim. Text: 12 layers, 8 heads, 512-dim, 49152 BPE vocab, 77 token max. Training uses AdamW with cosine decay, batch size 32768, mixed precision, and learnable temperature parameter (initialized to 0.07). Trained for 32 epochs on 256 V100 GPUs over 12 days.