Research & Writing

Blog

Dispatches from the edge of chaos — on nonlinear dynamics, AI, emergence, and the mathematics of complex systems.

Transformer

ImageGPT: Architecture & How It Works

ImageGPT (iGPT) applies the autoregressive GPT architecture directly to image generation by treating images as sequences of pixels or color clusters, demonstrating that language model approaches

2 min read
Transformer

Mixture of Experts (MoE): Architecture & How It Works

Mixture of Experts (MoE) is an architecture paradigm that scales model capacity dramatically while keeping computational cost manageable by routing each input to only a subset

2 min read
Transformer

AlphaFold2 Evoformer: Architecture & How It Works

The Evoformer is the core neural network module of AlphaFold2, DeepMind's breakthrough protein structure prediction system. It processes evolutionary and pairwise residue information through

2 min read
Transformer

Decision Transformer: Architecture & How It Works

The Decision Transformer reframes reinforcement learning as a sequence modeling problem, using a GPT-style Transformer to generate actions by conditioning on desired return-to-go, past states, and

2 min read
Transformer

GPT-4 & Gemini: Architecture & How They Work

GPT-4 and Gemini represent the frontier of large language models—massive multimodal systems capable of processing text, images, audio, and video while demonstrating near-human performance across

2 min read
Transformer

ImageBind: Architecture & How It Works

ImageBind is Meta's multimodal AI model that learns a joint embedding space across six modalities—images, text, audio, depth, thermal, and IMU data—using

2 min read
Transformer

Flamingo, LLaVA & InternVL: Architecture & How They Work

Flamingo, LLaVA, and InternVL represent the evolution of vision-language models that combine pretrained vision encoders with large language models, enabling conversational AI that can see and

2 min read
Transformer

Whisper: Architecture & How It Works

Whisper is OpenAI's general-purpose speech recognition model that approaches human-level robustness and accuracy by training on 680,000 hours of weakly supervised audio data

2 min read
Transformer

CLIP: Architecture & How It Works

CLIP (Contrastive Language-Image Pre-training) learns to connect images and text by training on 400 million image-text pairs from the internet, enabling zero-shot visual classification and serving

2 min read