ImageGPT: Architecture & How It Works
ImageGPT (iGPT) applies the autoregressive GPT architecture directly to image generation by treating images as sequences of pixels or color clusters, demonstrating that language model approaches
Dispatches from the edge of chaos — on nonlinear dynamics, AI, emergence, and the mathematics of complex systems.
ImageGPT (iGPT) applies the autoregressive GPT architecture directly to image generation by treating images as sequences of pixels or color clusters, demonstrating that language model approaches
Mixture of Experts (MoE) is an architecture paradigm that scales model capacity dramatically while keeping computational cost manageable by routing each input to only a subset
The Evoformer is the core neural network module of AlphaFold2, DeepMind's breakthrough protein structure prediction system. It processes evolutionary and pairwise residue information through
The Decision Transformer reframes reinforcement learning as a sequence modeling problem, using a GPT-style Transformer to generate actions by conditioning on desired return-to-go, past states, and
GPT-4 and Gemini represent the frontier of large language models—massive multimodal systems capable of processing text, images, audio, and video while demonstrating near-human performance across
The Segment Anything Model (SAM) is Meta's foundation model for image segmentation that can segment any object in any image given a prompt (point,
ImageBind is Meta's multimodal AI model that learns a joint embedding space across six modalities—images, text, audio, depth, thermal, and IMU data—using
Flamingo, LLaVA, and InternVL represent the evolution of vision-language models that combine pretrained vision encoders with large language models, enabling conversational AI that can see and
Whisper is OpenAI's general-purpose speech recognition model that approaches human-level robustness and accuracy by training on 680,000 hours of weakly supervised audio data
CLIP (Contrastive Language-Image Pre-training) learns to connect images and text by training on 400 million image-text pairs from the internet, enabling zero-shot visual classification and serving