Flamingo, LLaVA, and InternVL represent the evolution of vision-language models that combine pretrained vision encoders with large language models, enabling conversational AI that can see and reason about images.
Architecture Overview
All three architectures follow a similar pattern: a frozen or trainable vision encoder processes images into visual features, a projection module maps these features into the language model's embedding space, and a large language model generates text responses conditioned on both visual and text inputs.
Flamingo uses a frozen CLIP-like vision encoder and a frozen Chinchilla language model, connected by Perceiver Resampler modules (reducing variable-length visual features to a fixed number of tokens) and gated cross-attention layers inserted into the LLM. Only these connecting modules are trained.
LLaVA takes a simpler approach: a CLIP ViT-L/14 vision encoder produces patch features that are projected through a linear layer (or MLP) directly into the LLM's input space, interleaved with text tokens. The entire model (vision encoder + projector + LLM) is fine-tuned.
InternVL scales up with a large custom vision encoder (InternViT, up to 6B parameters) and dynamic resolution handling, processing images at their native aspect ratios by splitting them into tiles.
Key Innovations
- Flamingo's Perceiver Resampler: Compresses arbitrary-length visual features to 64 fixed tokens using learned queries and cross-attention, enabling efficient processing of multiple images
- Flamingo's few-shot learning: Designed for in-context learning with interleaved image-text sequences, achieving strong results with just a few examples
- LLaVA's simplicity: Showed that a simple linear projection from vision to language space is sufficient, with visual instruction tuning on GPT-4-generated data
- InternVL's dynamic resolution: Processes images at varying resolutions by splitting into 448×448 tiles, preserving fine details in high-resolution images
Common Use Cases
Visual question answering, image captioning, visual reasoning, document understanding, chart/diagram interpretation, OCR, medical image analysis, GUI navigation, and general multimodal chat assistants.
Notable Variants & Sizes
Flamingo: 3B, 9B, 80B. LLaVA: 7B, 13B (LLaVA-1.5 with Vicuna), LLaVA-NeXT (improved resolution handling). InternVL: 1B to 108B. Other notable models: Qwen-VL, CogVLM, Idefics, and Phi-3 Vision.
Technical Details
LLaVA-1.5 13B: CLIP ViT-L/14@336px (428M) → 2-layer MLP projector → Vicuna-13B. Produces 576 visual tokens (24×24 grid). Two-stage training: (1) pretrain projector on 558K image-text pairs, (2) full fine-tuning on 665K visual instruction data. InternVL 2.5: InternViT-6B → MLP → InternLM2/LLaMA3. Dynamic resolution from 448 to 4032 pixels. Training uses DeepSpeed ZeRO-3, BF16, and cosine learning rate schedule.