LLaMA, Mistral, Gemma & Qwen: Architecture & How They Work

LLaMA, Mistral, Gemma, and Qwen represent the leading family of open-weight large language models, all built on the decoder-only Dense Transformer architecture with specific optimizations that push the boundaries of efficiency and capability.

Architecture Overview

All four model families share the core decoder-only Transformer design: token embeddings with positional encoding feed into a stack of layers, each containing masked multi-head self-attention followed by a feed-forward network. Autoregressive generation produces one token at a time, attending to all previous tokens.

The input token is embedded, combined with Rotary Position Embeddings (RoPE), and passed through N transformer layers. Each layer applies pre-RMSNorm, then grouped-query or multi-head attention, a residual connection, another pre-RMSNorm, and a SwiGLU feed-forward network with a residual connection. The final output passes through RMSNorm and a linear head to produce next-token logits.

Key Innovations

LLaMA: Pioneered the efficient open-source recipe—RMSNorm, SwiGLU, RoPE, and no bias terms. LLaMA 3 scaled to 405B parameters with 15T training tokens and 128K context via RoPE frequency scaling
Mistral: Introduced Sliding Window Attention (SWA) for efficient long-context handling, and grouped-query attention at smaller model sizes. Mistral 7B outperformed LLaMA 2 13B
Gemma: Google's contribution with multi-query attention, GeGLU activation, and RoPE. Gemma 2 introduced logit soft-capping and alternating local/global attention layers
Qwen: Alibaba's series featuring SwiGLU, RoPE with YaRN extension for long context, and bias in QKV projections. Qwen 2.5 models match or exceed similarly-sized competitors across benchmarks

Common Use Cases

These models serve as foundation models for chat assistants, code generation, reasoning, summarization, translation, and domain-specific fine-tuning. Their open weights enable local deployment, custom fine-tuning with LoRA/QLoRA, and integration into production systems without API dependencies.

Notable Variants & Sizes

LLaMA 3: 8B, 70B, 405B. Mistral: 7B, 8x7B (Mixtral MoE), 8x22B, Large (123B). Gemma: 2B, 7B, 9B, 27B. Qwen 2.5: 0.5B, 1.5B, 3B, 7B, 14B, 32B, 72B. Each family offers instruction-tuned variants alongside base models.

Technical Details

LLaMA 3 70B: 80 layers, 64 heads (8 KV heads via GQA), dim 8192, vocab 128K with tiktoken. Mistral 7B: 32 layers, 32 heads (8 KV heads), dim 4096, sliding window 4096 tokens. Training uses AdamW, cosine schedule with warmup, BF16 mixed precision, and typically 2-15 trillion tokens of training data. All models support KV-cache for efficient inference, and are commonly quantized to 4-bit (GGUF/GPTQ/AWQ) for consumer hardware deployment.