Segment Anything Model (SAM): Architecture & How It Works

The Segment Anything Model (SAM) is Meta's foundation model for image segmentation that can segment any object in any image given a prompt (point, box, or text), trained on the largest segmentation dataset ever created with over 1 billion masks.

Architecture Overview

SAM consists of three components: an image encoder, a prompt encoder, and a mask decoder. The image encoder is a ViT (MAE-pretrained ViT-Huge) that processes the input image once to produce image embeddings. The prompt encoder handles sparse prompts (points, boxes) through positional encodings and dense prompts (masks) through convolutions.

The mask decoder is a lightweight two-layer Transformer decoder. It takes learned output tokens (representing mask predictions), the prompt embeddings, and the image embeddings. The decoder uses bidirectional cross-attention—tokens attend to the image, and the image attends to the tokens. After decoding, the output tokens are upsampled and combined with the image embedding through element-wise multiplication to produce the final mask predictions.

SAM predicts multiple valid masks (typically 3) for ambiguous prompts, along with confidence scores, handling the inherent ambiguity of segmentation (e.g., a point on a person's shirt could mean "shirt," "person," or "group of people").

Key Innovations

Promptable segmentation: A single model handles points, boxes, masks, and text as prompts, unifying interactive and automatic segmentation
SA-1B dataset: 11M images with 1.1B masks, created through a data engine that iteratively used the model to assist human annotators, then moved to fully automatic mask generation
Ambiguity-aware: Predicting multiple masks with scores handles inherently ambiguous prompts gracefully
Efficient design: The heavy image encoder runs once per image, while the lightweight prompt encoder and mask decoder run in ~50ms, enabling real-time interactive segmentation

Common Use Cases

Interactive image segmentation, automatic instance segmentation, video object segmentation (SAM 2), medical image segmentation, satellite/aerial imagery analysis, AR/VR object selection, data annotation tools, and as a component in larger vision pipelines.

Notable Variants & Sizes

SAM ViT-Base (91M), SAM ViT-Large (308M), SAM ViT-Huge (636M). SAM 2 extends to video with memory-based tracking. MobileSAM and EfficientSAM provide faster variants for edge devices. SAM-HQ improves mask quality for fine details. Grounding-DINO + SAM enables text-prompted segmentation.

Technical Details

SAM ViT-H: Image encoder is MAE-pretrained ViT-Huge (32 layers, 16 heads, 1280-dim) processing 1024×1024 images with 16×16 patches (64×64 = 4096 tokens), outputting 256-dim embeddings at 64×64 spatial resolution. Mask decoder: 2 Transformer layers, 8 heads, 256-dim, with 4 output tokens (3 masks + 1 IoU prediction). Prompt encoder: points/boxes encoded with learned embeddings + Fourier positional encoding. Training: SA-1B dataset, focal loss + dice loss + MSE for IoU prediction, AdamW, linear warmup, 90K iterations.