DINO & DINOv2: Architecture & How They Work

DINO (Self-Distillation with No Labels) and DINOv2 are self-supervised learning methods that train Vision Transformers to learn powerful visual features without any labeled data, producing representations that emerge with remarkable properties like unsupervised object segmentation.

Architecture Overview

DINO uses a student-teacher framework where both networks share the same architecture (typically ViT) but with different parameters. The student is trained with gradient descent, while the teacher's weights are updated as an exponential moving average (EMA) of the student's weights.

Both networks receive different augmented views of the same image. The teacher receives two global views (large crops covering 50%+ of the image), while the student receives these same global views plus several local views (small crops covering ~5-25%). The student is trained to match the teacher's output distribution across all view pairs, encouraging local-to-global correspondence learning.

DINOv2 scales this approach significantly: it uses a curated dataset of 142M images (LVD-142M), adds iBOT-style masked image modeling alongside the DINO objective, and trains ViT-Giant models with KoLeo regularization for uniform feature distribution.

Key Innovations

Self-distillation: Using the model's own EMA as a teacher eliminates the need for a separately trained teacher or labels
Centering and sharpening: Prevents mode collapse by centering the teacher's output (subtracting mean) and using low temperature in the teacher's softmax
Emergent properties: DINO's self-attention maps naturally segment objects without any segmentation training, revealing learned semantic understanding
DINOv2's data curation: Automated pipeline to curate high-quality, diverse training data from uncurated web images, proving data quality matters as much as method

Common Use Cases

Feature extraction for classification, object detection, segmentation, depth estimation, image retrieval, and visual correspondence. DINOv2 features serve as a universal visual backbone, competitive with supervised pretraining on dozens of benchmarks without fine-tuning.

Notable Variants & Sizes

DINO: ViT-S/16, ViT-S/8, ViT-B/16, ViT-B/8. DINOv2: ViT-S/14 (21M), ViT-B/14 (86M), ViT-L/14 (300M), ViT-g/14 (1.1B). DINOv2 distilled models compress ViT-g knowledge into smaller architectures. DINOv2 with registers adds register tokens to address attention artifact issues.

Technical Details

DINO ViT-B/16: 12 layers, 12 heads, 768-dim, trained 300-800 epochs on ImageNet. Teacher EMA momentum 0.996→1 with cosine schedule. Output heads project to 65536-dim with 3-layer MLP plus L2 normalization. Temperature: student τ_s=0.1, teacher τ_t=0.04-0.07. DINOv2 ViT-g/14: 40 layers, 24 heads, 1536-dim, trained on LVD-142M for 625K iterations with batch size 3072. Combines DINO loss + iBOT masked prediction loss. Training uses AdamW, weight decay 0.04-0.4, and mixed precision (BF16).