Mixture of Experts (MoE): Architecture & How It Works

Mixture of Experts (MoE) is an architecture paradigm that scales model capacity dramatically while keeping computational cost manageable by routing each input to only a subset of specialized "expert" sub-networks, enabling trillion-parameter models with practical inference costs.

Architecture Overview

In a standard MoE Transformer, the dense feed-forward network (FFN) in each layer is replaced by multiple parallel FFN "experts" plus a gating/routing network. The router is typically a small linear layer that takes the input token and produces a probability distribution over experts.

For each token, the router selects the top-k experts (typically k=1 or k=2) and routes the token to those experts only. The expert outputs are weighted by their routing probabilities and summed. The attention layers remain dense (shared across all tokens). This means total model parameters can be very large (as each expert adds parameters), but compute per token stays fixed (only k experts activate).

A critical challenge is load balancing—without intervention, routing tends to collapse to using only a few experts. Auxiliary losses encourage uniform expert utilization: a load balancing loss penalizes uneven routing, and sometimes a router z-loss prevents routing logits from growing too large.

Key Innovations

Conditional computation: Activating only a subset of parameters per input enables much larger models without proportional compute increase
Expert specialization: Different experts can specialize in different types of tokens or concepts, increasing model capacity per FLOP
Top-k routing: Sparse routing with differentiable top-k selection enables end-to-end training while maintaining sparsity
Load balancing losses: Auxiliary losses ensure all experts are utilized, preventing capacity waste from routing collapse

Common Use Cases

Large-scale language models (GPT-4, Mixtral, Switch Transformer), machine translation, multi-task learning, and any scenario where model capacity needs to scale beyond what dense models can afford computationally.

Notable Variants & Sizes

Switch Transformer (1.6T parameters, top-1 routing), GShard (600B, top-2), Mixtral 8x7B (46.7B total, 12.9B active, 8 experts top-2), Mixtral 8x22B (141B total, 39B active), DBRX (132B total, 36B active, 16 experts top-4), Grok-1 (314B, 8 experts top-2), Arctic (480B total), DeepSeek-MoE (16B with fine-grained experts).

Technical Details

Mixtral 8x7B: 32 layers, 32 heads, dim 4096. Each layer has 8 expert FFNs (dim 14336) with top-2 routing—for each token, 2 of 8 experts activate, so active params ≈ 12.9B of 46.7B total. Router: linear projection from d_model to n_experts, softmax, top-k selection. Load balancing loss weight typically 0.01-0.1. Expert capacity factor limits max tokens per expert per batch (typically 1.0-1.5×). Training challenges: expert parallelism across GPUs, all-to-all communication for routing, and handling dropped tokens when experts overflow capacity. DeepSeek-MoE uses fine-grained experts (more, smaller experts) plus shared experts that always activate.