The Dense Transformer is the foundational architecture behind GPT, LLaMA, and most modern large language models. It processes text by attending to all tokens simultaneously, enabling powerful language understanding and generation capabilities.

Architecture Overview

A Dense Transformer consists of a stack of identical layers, each containing two sub-layers: a multi-head self-attention mechanism and a position-wise feed-forward network. Input tokens are first converted to embeddings and combined with positional encodings, then passed through the layer stack sequentially.

In decoder-only variants (GPT, LLaMA), causal masking ensures each token can only attend to previous tokens, enabling autoregressive generation. The attention mechanism computes queries, keys, and values from the input, then calculates attention weights as softmax(QK^T / sqrt(d_k))V.

Each layer applies layer normalization and residual connections around both sub-layers. The feed-forward network typically expands the hidden dimension by 4x (e.g., 4096 → 16384) before projecting back down, using GeLU or SiLU activation functions.

Key Innovations

The original Transformer (Vaswani et al., 2017) introduced self-attention as a replacement for recurrence, enabling full parallelization during training. Key innovations in modern variants include:

  • Rotary Position Embeddings (RoPE): Used in LLaMA and most modern models, encoding relative position information directly into the attention computation
  • Pre-norm vs Post-norm: Modern models use pre-normalization (LayerNorm before attention) for more stable training
  • Grouped Query Attention (GQA): Shares key-value heads across query heads to reduce memory during inference
  • SwiGLU activation: Replaces standard FFN with a gated linear unit variant for improved performance

Common Use Cases

Dense Transformers power virtually all modern language AI: text generation, code completion, question answering, summarization, translation, reasoning, and instruction following. They form the backbone of ChatGPT, Claude, Gemini, and open-source models like LLaMA and Mistral.

Notable Variants & Sizes

GPT-2 (117M to 1.5B parameters), GPT-3 (175B), LLaMA 2 (7B to 70B), LLaMA 3 (8B to 405B), and Mistral 7B represent the evolution of this architecture. Models range from small enough to run on consumer hardware to massive models requiring distributed inference across many GPUs.

Technical Details

Typical configurations include hidden dimensions of 4096-8192, 32-80 attention heads, and 32-80 layers. Training uses AdamW optimizer with cosine learning rate scheduling, and context lengths range from 2048 (GPT-2) to 128K+ tokens (modern models). Training data scales from hundreds of billions to tens of trillions of tokens. FP16/BF16 mixed precision training is standard, with quantization (4-bit, 8-bit) commonly used for inference.