Stacked LSTM & GRU: Architecture & How They Work

Stacked LSTMs and GRUs are deep recurrent neural network architectures that process sequential data by maintaining hidden states across time steps, with gating mechanisms that control information flow to handle long-range dependencies.

Architecture Overview

An LSTM (Long Short-Term Memory) cell maintains two states: the hidden state h and the cell state c. At each time step, three gates control information flow. The forget gate decides what to discard from the cell state: f_t = σ(W_f·[h_{t-1}, x_t] + b_f). The input gate decides what new information to store: i_t = σ(W_i·[h_{t-1}, x_t] + b_i). A candidate cell state is computed: c̃_t = tanh(W_c·[h_{t-1}, x_t] + b_c). The cell state is updated: c_t = f_t ⊙ c_{t-1} + i_t ⊙ c̃_t. The output gate produces the hidden state: o_t = σ(W_o·[h_{t-1}, x_t] + b_o), h_t = o_t ⊙ tanh(c_t).

GRU (Gated Recurrent Unit) simplifies this with two gates: reset gate r_t = σ(W_r·[h_{t-1}, x_t]) controls how much past state to forget, and update gate z_t = σ(W_z·[h_{t-1}, x_t]) controls the blend between old and new state: h_t = (1-z_t) ⊙ h_{t-1} + z_t ⊙ tanh(W·[r_t ⊙ h_{t-1}, x_t]).

Stacking means placing multiple LSTM/GRU layers on top of each other, where the hidden states of one layer become the inputs to the next, enabling hierarchical feature extraction across different time scales.

Key Innovations

LSTM cell state highway: The cell state acts as a conveyor belt for gradients, with additive updates (not multiplicative), mitigating vanishing gradients over hundreds of time steps
GRU simplification: Combining forget and input gates into a single update gate and merging cell/hidden state reduces parameters by ~25% with comparable performance
Bidirectional processing: Running two RNNs in opposite directions and concatenating outputs captures both past and future context
Attention mechanisms: Bahdanau/Luong attention over LSTM encoder states enabled seq2seq models before Transformers

Common Use Cases

Time series forecasting, speech recognition (pre-Transformer), machine translation (seq2seq), text generation, sentiment analysis, named entity recognition, music generation, handwriting recognition, and any sequential processing task with moderate sequence lengths.

Notable Variants & Sizes

Standard LSTM, Peephole LSTM (gates access cell state directly), Bidirectional LSTM, Seq2Seq with Attention (Bahdanau, Luong). OpenAI's 2017 language model used a single 4096-unit LSTM. Google's NMT used 8-layer LSTMs. ELMo used 2-layer bidirectional LSTMs for contextualized word embeddings. AWD-LSTM is a well-regularized variant for language modeling.

Technical Details

Typical language model: 2-3 stacked LSTM layers, 256-1024 hidden units each. An LSTM with hidden size h has 4(h² + h·input_dim + h) parameters per layer (4 gates × weight matrices). GRU has 3(h² + h·input_dim + h) parameters (3 gates). Training: truncated backpropagation through time (TBPTT) with sequence length 35-70, SGD or Adam, gradient clipping (max norm 0.25-1.0), dropout between layers (0.2-0.5), weight tying (sharing embedding and output weights). AWD-LSTM achieves 57.3 perplexity on Penn Treebank with 3 layers, 1150 hidden units, and extensive regularization (variational dropout, weight dropout, AR/TAR regularization).