RWKV: Architecture & How It Works

RWKV (Receptance Weighted Key Value) is a novel architecture that combines the efficient parallelizable training of Transformers with the efficient O(1) inference of RNNs, achieving competitive language modeling performance with linear computational complexity.

Architecture Overview

RWKV replaces self-attention with a linear attention mechanism based on the WKV (Weighted Key Value) operator. Each RWKV layer consists of two sub-blocks: a time-mixing block (replacing attention) and a channel-mixing block (replacing the FFN).

In the time-mixing block, the input is linearly projected into R (receptance), W (weight/decay), K (key), and V (value) vectors. The WKV computation uses an exponentially decaying weighting scheme: each position's output is a weighted sum of all previous values, where weights decay exponentially based on relative position plus a learned bonus for the current position. This creates an attention-like mechanism without the quadratic cost.

The channel-mixing block acts as a gated FFN, using receptance (sigmoid gating) and key-value projections with squared ReLU activation. Both blocks use a token-shift mechanism: the current input is linearly interpolated with the previous timestep's input before projection, providing temporal context.

Key Innovations

Linear attention formulation: The WKV operator achieves attention-like behavior with O(T) complexity instead of O(T²), enabling efficient training on long sequences
RNN mode inference: During generation, RWKV maintains a fixed-size state and processes one token at a time with O(1) complexity—no growing KV cache
Token shift: A simple linear interpolation with the previous token replaces positional embeddings, providing relative position information
Exponential decay: Learned per-channel decay rates (W) create naturally diminishing attention to distant tokens, balancing local and global context

Common Use Cases

Language modeling, text generation, long-context processing, edge device deployment (constant memory inference), real-time text processing, and as an efficient alternative to Transformers for any sequential task.

Notable Variants & Sizes

RWKV-4: 169M, 430M, 1.5B, 3B, 7B, 14B. RWKV-5 (Eagle): improved time-mixing with multi-headed formulation. RWKV-6 (Finch): further improvements with data-dependent decay and enhanced receptance. The community maintains active development with models trained on trillions of tokens.

Technical Details

RWKV-4 7B: 32 layers, dim 4096, trained on the Pile (330B tokens). Time-mixing: R, K, V ∈ R^d, W (decay) ∈ R^d (learned, per-channel). WKV formula: wkv_t = (Σ_{i=1}^{t-1} e^{-(t-1-i)w+k_i} · v_i + e^{u+k_t} · v_t) / (Σ_{i=1}^{t-1} e^{-(t-1-i)w+k_i} + e^{u+k_t}), where u is a learned bonus. Token shift: x'_t = μ·x_t + (1-μ)·x_{t-1} with learned μ per channel. Training uses AdamW, deepspeed, BF16, and the same data regime as comparable Transformers. RWKV 14B achieves performance between GPT-NeoX 20B and Pythia 12B.