Hyena is a sub-quadratic attention replacement that uses long convolutions and element-wise gating to achieve Transformer-quality performance with significantly reduced computational cost, particularly for long sequences.
Architecture Overview
Hyena replaces the standard self-attention mechanism with a hierarchy of two operations: long convolutions and element-wise multiplicative gating. The Hyena operator of order N interleaves N-1 long convolution filters with N element-wise gating operations.
For a typical order-2 Hyena operator: the input is projected into three branches (similar to Q, K, V in attention). One branch serves as the value, the second as a gate after a long convolution, and the third as another gate. The computation proceeds as: apply long convolution to first gate, element-wise multiply with the value, apply long convolution to the result, element-wise multiply with the second gate.
The long convolutions are parameterized implicitly using a small neural network (typically a feed-forward network with sinusoidal positional features) that generates the convolution filter values, rather than storing the full filter explicitly. This enables filters of any length without proportional parameter growth. The convolutions are computed efficiently in the frequency domain using FFT.
Key Innovations
- Implicit convolution parameterization: A small MLP generates convolution filters of arbitrary length, enabling sequence-length-agnostic models with minimal parameters
- Sub-quadratic complexity: O(N log N) via FFT-based convolution, compared to O(N²) for standard attention
- Data-controlled gating: Element-wise multiplication with input-dependent gates provides the data-dependent filtering that static convolutions lack
- Drop-in replacement: Hyena operators can replace attention layers in existing Transformer architectures with minimal modification
Common Use Cases
Language modeling with long contexts, genomics (DNA sequence modeling where sequences reach millions of bases), efficient processing of very long documents, and research into sub-quadratic alternatives to attention.
Notable Variants & Sizes
Hyena (original), HyenaDNA (trained on human genome, handles sequences up to 1M bases at single-nucleotide resolution), StripedHyena (7B parameter model by Together AI combining Hyena and attention layers), and SavannahHyena. Evo (genomics foundation model) uses StripedHyena architecture at 7B scale.
Technical Details
Hyena operator: order 2 (standard), with 3 projections from input dim d. Implicit filter MLP: 2-layer FFN with sinusoidal positional features (frequencies learned), generating filters of length L. FFT-based convolution: pad to next power of 2, FFT both signal and filter, element-wise multiply, inverse FFT. A Hyena-based language model with 355M parameters achieves comparable perplexity to a GPT-style Transformer on the Pile. StripedHyena 7B interleaves Hyena and attention layers (roughly 29 Hyena + 3 attention) for a hybrid approach, trained on 2T tokens. Training uses standard Transformer recipes: AdamW, cosine schedule, BF16.