Mamba, S4 & State Space Models: Architecture & How They Work

State Space Models (SSMs) including S4 and Mamba offer an alternative to Transformers for sequence modeling, achieving linear-time complexity during training and constant-time per-step inference, while matching or exceeding Transformer performance on long-sequence tasks.

Architecture Overview

SSMs are based on continuous-time linear dynamical systems: x'(t) = Ax(t) + Bu(t), y(t) = Cx(t) + Du(t), where x is the hidden state, u is the input, and y is the output. These continuous equations are discretized for digital computation using methods like zero-order hold (ZOH), producing recurrent equations: x_k = Ā·x_{k-1} + B̄·u_k, y_k = C·x_k.

S4 (Structured State Spaces for Sequences) introduced a specific parameterization of the A matrix using HiPPO initialization—a mathematical framework for optimal polynomial approximation of continuous signals. S4 showed that proper initialization of the state matrix is crucial for learning long-range dependencies.

Mamba builds on S4 by introducing selective state spaces: instead of fixed A, B, C parameters, Mamba makes B, C, and the discretization step Δ input-dependent (computed from the input via linear projections). This selectivity allows the model to dynamically filter relevant information from the input sequence.

Key Innovations

Dual computation mode: SSMs can be computed as either a recurrence (O(1) per step, for inference) or a convolution (parallelizable, for training), getting the best of both worlds
HiPPO initialization: S4's structured A matrix based on polynomial approximation theory enables learning dependencies across thousands of timesteps
Mamba's selectivity: Input-dependent parameters allow content-aware reasoning, closing the gap with attention-based models on language tasks
Hardware-efficient implementation: Mamba uses a custom CUDA kernel with a scan-based algorithm, avoiding materializing the full state matrix in GPU HBM

Common Use Cases

Long-sequence modeling (audio, DNA, time series), language modeling (as Transformer alternative), speech recognition, genomics, video understanding, and any task requiring efficient processing of very long sequences (100K+ tokens).

Notable Variants & Sizes

S4 (original), S4D (diagonal simplification), S5 (parallel scan), H3 (hungry hungry hippos), Hyena (attention-free), Mamba (130M to 2.8B), Mamba-2 (improved with structured state space duality), Jamba (Mamba + Transformer hybrid by AI21). Vision variants: VMamba, Vim (Vision Mamba), PlainMamba.

Technical Details

Mamba 2.8B: 64 layers, model dim 2560, state dim 16, conv kernel size 4, expand factor 2 (inner dim 5120). Each Mamba block: linear projection → 1D conv → SiLU → selective SSM → output projection, with a parallel gating branch. State expansion N=16, so each channel maintains a 16-dim hidden state. Training: 300B tokens on the Pile, AdamW, cosine schedule, batch size 0.5M tokens. Mamba 2.8B matches Transformer++ 2.8B on language benchmarks while being 5x faster at inference for long sequences (generation throughput 3-5x higher).