T5 (Text-to-Text Transfer Transformer) and BART (Bidirectional and Auto-Regressive Transformer) are encoder-decoder Transformer models that frame all NLP tasks as text-to-text problems, excelling at tasks requiring both understanding and generation.

Architecture Overview

Both models use the full encoder-decoder Transformer architecture. The encoder processes the input sequence with bidirectional self-attention (like BERT), producing contextualized representations. The decoder generates output tokens autoregressively, using both causal self-attention and cross-attention to the encoder's output.

In T5, every NLP task is converted to a text-to-text format with a task-specific prefix. For example, translation becomes "translate English to German: [text]" and the model generates the German output. BART uses a similar structure but was pretrained with a denoising objective that corrupts input text and trains the model to reconstruct the original.

Cross-attention layers in the decoder attend to the full encoder output, allowing the decoder to focus on relevant parts of the input when generating each output token.

Key Innovations

  • T5's text-to-text framework: Unifies classification, translation, summarization, and question answering under a single interface, enabling a single model to handle any NLP task
  • T5's systematic study: The T5 paper explored dozens of design decisions (architectures, pretraining objectives, data, scale), providing a comprehensive recipe for training language models
  • BART's denoising pretraining: Combines multiple corruption strategies—token masking, deletion, infilling, sentence permutation, and document rotation—for robust pretraining
  • Relative position biases: T5 uses learned relative position biases instead of absolute positional embeddings, improving generalization to different sequence lengths

Common Use Cases

Summarization (BART excels here), machine translation, question answering, text classification, data-to-text generation, dialogue systems, and as the base for instruction-tuned models like FLAN-T5.

Notable Variants & Sizes

T5-Small (60M), T5-Base (220M), T5-Large (770M), T5-3B, T5-11B. FLAN-T5 adds instruction tuning across 1800+ tasks. BART-Base (140M, 6+6 layers), BART-Large (400M, 12+12 layers). mBART extends to 25+ languages. UL2 (20B) unifies multiple pretraining objectives.

Technical Details

T5-Base: 12 encoder + 12 decoder layers, 12 heads, dim 768, FFN dim 3072. Uses SentencePiece tokenizer with 32K vocabulary. Pretrained on C4 (Colossal Clean Crawled Corpus, ~750GB). T5 uses a span corruption objective masking 15% of tokens in contiguous spans. BART-Large: 12+12 layers, 16 heads, dim 1024, trained on 160GB of text. Both use AdamW with linear warmup and inverse square root decay.