DeiT & MAE: Architecture & How They Work

DeiT (Data-efficient Image Transformers) and MAE (Masked Autoencoders) are two breakthrough approaches to training Vision Transformers effectively—DeiT through advanced training strategies and distillation, MAE through self-supervised masked image modeling.

Architecture Overview

DeiT uses the standard ViT architecture but adds a distillation token alongside the [CLS] token. During training, the [CLS] token learns from the ground-truth label while the distillation token learns from a teacher model's (typically a CNN like RegNet) predictions. At inference, both token outputs are averaged for the final prediction.

MAE uses an asymmetric encoder-decoder design. During pretraining, 75% of image patches are randomly masked. Only the visible 25% of patches are processed by the full ViT encoder (saving significant computation). A lightweight decoder then takes the encoded visible patches plus mask tokens (with positional embeddings) and reconstructs the original pixel values of the masked patches.

Key Innovations

DeiT training recipe: Proved ViT could be trained effectively on ImageNet-1K alone (no JFT-300M) using extensive augmentation, regularization, and careful hyperparameter selection
Knowledge distillation token: DeiT's dedicated distillation token learns complementary features from a CNN teacher, combining the strengths of both architectures
MAE's high masking ratio: Masking 75% of patches forces the model to learn rich semantic representations rather than relying on local interpolation
MAE's asymmetric design: Processing only 25% of patches with the heavy encoder makes pretraining 3-4x faster than processing all patches

Common Use Cases

DeiT: efficient image classification, knowledge distillation research, and as pretrained backbones for detection/segmentation. MAE: self-supervised pretraining for vision tasks, learning visual representations from unlabeled data, and transfer learning to downstream tasks with limited labels.

Notable Variants & Sizes

DeiT-Tiny (5M), DeiT-Small (22M), DeiT-Base (86M), DeiT-III (improved training recipe). MAE uses ViT-Base (86M), ViT-Large (307M), and ViT-Huge (632M) as encoders with a lightweight decoder (8 blocks, 512-dim). MAE ViT-Huge achieved 86.9% top-1 on ImageNet, setting a new self-supervised learning record.

Technical Details

DeiT-Base: Same as ViT-Base (12 layers, 12 heads, 768-dim) plus distillation token. Trained for 300 epochs on ImageNet-1K with AdamW, cosine schedule, RandAugment, Mixup, CutMix, random erasing, repeated augmentation, and label smoothing. MAE encoder: standard ViT processing 25% visible patches. MAE decoder: 8 Transformer blocks, 512-dim, 16 heads. Pretrained 1600 epochs on ImageNet-1K. The reconstruction target is normalized pixel values of 16×16 patches, with MSE loss applied only on masked patches.