RWKV (Receptance Weighted Key Value) is a novel architecture that combines the efficient parallelizable training of Transformers with the efficient O(1) inference of RNNs, achieving competitive language modeling performance with linear computational complexity.
Architecture Overview
RWKV replaces self-attention with a linear attention mechanism based on the WKV (Weighted Key Value) operator. Each RWKV layer consists of two sub-blocks: a time-mixing block (replacing attention) and a channel-mixing block (replacing the FFN).
In the time-mixing block, the input is linearly projected into R (receptance), W (weight/decay), K (key), and V (value) vectors. The WKV computation uses an exponentially decaying weighting scheme: each position's output is a weighted sum of all previous values, where weights decay exponentially based on relative position plus a learned bonus for the current position. This creates an attention-like mechanism without the quadratic cost.
The channel-mixing block acts as a gated FFN, using receptance (sigmoid gating) and key-value projections with squared ReLU activation. Both blocks use a token-shift mechanism: the current input is linearly interpolated with the previous timestep's input before projection, providing temporal context.
Key Innovations
- Linear attention formulation: The WKV operator achieves attention-like behavior with O(T) complexity instead of O(T²), enabling efficient training on long sequences
- RNN mode inference: During generation, RWKV maintains a fixed-size state and processes one token at a time with O(1) complexity—no growing KV cache
- Token shift: A simple linear interpolation with the previous token replaces positional embeddings, providing relative position information
- Exponential decay: Learned per-channel decay rates (W) create naturally diminishing attention to distant tokens, balancing local and global context
Common Use Cases
Language modeling, text generation, long-context processing, edge device deployment (constant memory inference), real-time text processing, and as an efficient alternative to Transformers for any sequential task.
Notable Variants & Sizes
RWKV-4: 169M, 430M, 1.5B, 3B, 7B, 14B. RWKV-5 (Eagle): improved time-mixing with multi-headed formulation. RWKV-6 (Finch): further improvements with data-dependent decay and enhanced receptance. The community maintains active development with models trained on trillions of tokens.
Technical Details
RWKV-4 7B: 32 layers, dim 4096, trained on the Pile (330B tokens). Time-mixing: R, K, V ∈ R^d, W (decay) ∈ R^d (learned, per-channel). WKV formula: wkv_t = (Σ_{i=1}^{t-1} e^{-(t-1-i)w+k_i} · v_i + e^{u+k_t} · v_t) / (Σ_{i=1}^{t-1} e^{-(t-1-i)w+k_i} + e^{u+k_t}), where u is a learned bonus. Token shift: x'_t = μ·x_t + (1-μ)·x_{t-1} with learned μ per channel. Training uses AdamW, deepspeed, BF16, and the same data regime as comparable Transformers. RWKV 14B achieves performance between GPT-NeoX 20B and Pythia 12B.