T5 & BART: Architecture & How They Work

T5 (Text-to-Text Transfer Transformer) and BART (Bidirectional and Auto-Regressive Transformer) are encoder-decoder Transformer models that frame all NLP tasks as text-to-text problems, excelling at tasks requiring both understanding and generation.

Architecture Overview

Both models use the full encoder-decoder Transformer architecture. The encoder processes the input sequence with bidirectional self-attention (like BERT), producing contextualized representations. The decoder generates output tokens autoregressively, using both causal self-attention and cross-attention to the encoder's output.

In T5, every NLP task is converted to a text-to-text format with a task-specific prefix. For example, translation becomes "translate English to German: [text]" and the model generates the German output. BART uses a similar structure but was pretrained with a denoising objective that corrupts input text and trains the model to reconstruct the original.

Cross-attention layers in the decoder attend to the full encoder output, allowing the decoder to focus on relevant parts of the input when generating each output token.

Key Innovations

T5's text-to-text framework: Unifies classification, translation, summarization, and question answering under a single interface, enabling a single model to handle any NLP task
T5's systematic study: The T5 paper explored dozens of design decisions (architectures, pretraining objectives, data, scale), providing a comprehensive recipe for training language models
BART's denoising pretraining: Combines multiple corruption strategies—token masking, deletion, infilling, sentence permutation, and document rotation—for robust pretraining
Relative position biases: T5 uses learned relative position biases instead of absolute positional embeddings, improving generalization to different sequence lengths

Common Use Cases

Summarization (BART excels here), machine translation, question answering, text classification, data-to-text generation, dialogue systems, and as the base for instruction-tuned models like FLAN-T5.

Notable Variants & Sizes

T5-Small (60M), T5-Base (220M), T5-Large (770M), T5-3B, T5-11B. FLAN-T5 adds instruction tuning across 1800+ tasks. BART-Base (140M, 6+6 layers), BART-Large (400M, 12+12 layers). mBART extends to 25+ languages. UL2 (20B) unifies multiple pretraining objectives.

Technical Details

T5-Base: 12 encoder + 12 decoder layers, 12 heads, dim 768, FFN dim 3072. Uses SentencePiece tokenizer with 32K vocabulary. Pretrained on C4 (Colossal Clean Crawled Corpus, ~750GB). T5 uses a span corruption objective masking 15% of tokens in contiguous spans. BART-Large: 12+12 layers, 16 heads, dim 1024, trained on 160GB of text. Both use AdamW with linear warmup and inverse square root decay.