xLSTM: Architecture & How It Works
xLSTM (Extended Long Short-Term Memory) modernizes the classic LSTM architecture with exponential gating and novel memory structures, challenging Transformers and SSMs on language modeling while retaining
Dispatches from the edge of chaos — on nonlinear dynamics, AI, emergence, and the mathematics of complex systems.
xLSTM (Extended Long Short-Term Memory) modernizes the classic LSTM architecture with exponential gating and novel memory structures, challenging Transformers and SSMs on language modeling while retaining
RWKV (Receptance Weighted Key Value) is a novel architecture that combines the efficient parallelizable training of Transformers with the efficient O(1) inference of RNNs, achieving
State Space Models (SSMs) including S4 and Mamba offer an alternative to Transformers for sequence modeling, achieving linear-time complexity during training and constant-time per-step inference, while
The Decision Transformer reframes reinforcement learning as a sequence modeling problem, using a GPT-style Transformer to generate actions by conditioning on desired return-to-go, past states, and
GPT-4 and Gemini represent the frontier of large language models—massive multimodal systems capable of processing text, images, audio, and video while demonstrating near-human performance across
The Segment Anything Model (SAM) is Meta's foundation model for image segmentation that can segment any object in any image given a prompt (point,
ImageBind is Meta's multimodal AI model that learns a joint embedding space across six modalities—images, text, audio, depth, thermal, and IMU data—using
Flamingo, LLaVA, and InternVL represent the evolution of vision-language models that combine pretrained vision encoders with large language models, enabling conversational AI that can see and
Whisper is OpenAI's general-purpose speech recognition model that approaches human-level robustness and accuracy by training on 680,000 hours of weakly supervised audio data
CLIP (Contrastive Language-Image Pre-training) learns to connect images and text by training on 400 million image-text pairs from the internet, enabling zero-shot visual classification and serving