xLSTM: Architecture & How It Works

xLSTM (Extended Long Short-Term Memory) modernizes the classic LSTM architecture with exponential gating and novel memory structures, challenging Transformers and SSMs on language modeling while retaining the LSTM's strengths in sequential processing.

Architecture Overview

xLSTM introduces two new LSTM variants that are composed into a residual network: sLSTM (scalar LSTM) and mLSTM (matrix LSTM). An xLSTM architecture alternates between sLSTM and mLSTM blocks, with each block wrapped in pre-LayerNorm and residual connections, similar to modern Transformer designs.

sLSTM extends the classic LSTM with exponential gating: the input and forget gates use exponential activation functions instead of sigmoid, dramatically increasing the range of gating values. A stabilization technique prevents numerical overflow. sLSTM also introduces multiple memory cells per head and a recurrent connection between heads.

mLSTM replaces the scalar memory cell with a matrix memory C ∈ R^{d×d}, stored via an outer product update rule: C_t = f_t·C_{t-1} + i_t·(v_t·k_t^T). Retrieval uses matrix-vector multiplication: h_t = o_t ⊙ (C_t·q_t) / max(|f̃_t·n_{t-1} + ĩ_t·k_t|·q_t, 1), analogous to attention's QKV mechanism but with a recurrent state.

Key Innovations

Exponential gating: Replacing sigmoid gates with exponential activations enables much more precise control over information flow, with stabilization to prevent overflow
Matrix memory (mLSTM): Storing key-value associations in a matrix enables Transformer-like retrieval capability within a recurrent framework, and is fully parallelizable
Covariance update rule: mLSTM's outer product storage and retrieval mirrors the key-value mechanism of attention but with bounded, constant memory
LSTM modernization: Pre-norm, residual connections, and block design from modern Transformers applied to LSTM architecture

Common Use Cases

Language modeling, sequence processing where constant-memory inference is important, time series forecasting, and as a research direction exploring alternatives to the Transformer's quadratic attention mechanism.

Notable Variants & Sizes

xLSTM models range from 125M to 1.3B parameters in published experiments. The architecture uses different ratios of sLSTM and mLSTM blocks. Common configurations place mLSTM blocks in the majority (e.g., 7:1 ratio of mLSTM to sLSTM). Vision-xLSTM adapts the architecture for image recognition tasks.

Technical Details

xLSTM 1.3B: ~48 blocks, dim 2048, trained on 300B tokens from SlimPajama. mLSTM block: pre-norm → up-projection (4× expansion) → causal 1D conv (kernel 4) → mLSTM cell with d_head=64 → down-projection. sLSTM block: pre-norm → 4-head sLSTM with 4 cells per head → GeLU gated FFN. The exponential gate stabilization normalizes by the maximum gate value in the log domain. Training uses AdamW with cosine schedule. At 1.3B scale, xLSTM matches or slightly exceeds Mamba and LLaMA on perplexity benchmarks, with constant-memory inference regardless of sequence length.