BERT & RoBERTa: Architecture & How They Work

BERT (Bidirectional Encoder Representations from Transformers) and RoBERTa revolutionized NLP by introducing bidirectional pretraining, enabling models to understand context from both directions simultaneously for superior language understanding.

Architecture Overview

BERT uses the encoder portion of the Transformer architecture. Input text is tokenized using WordPiece, and each token embedding is summed with a positional embedding and segment embedding. A special [CLS] token is prepended, and [SEP] tokens separate sentence pairs.

The token representations pass through L layers of bidirectional multi-head self-attention and feed-forward networks. Unlike GPT's causal masking, BERT uses no attention mask (except for padding), so every token can attend to every other token in both directions. Each layer applies post-LayerNorm with residual connections around both the attention and FFN sub-layers.

For classification tasks, the [CLS] token's final hidden state is used. For token-level tasks (NER, QA), individual token representations are used with task-specific heads.

Key Innovations

Masked Language Modeling (MLM): BERT randomly masks 15% of input tokens and trains the model to predict them, enabling bidirectional context understanding
Next Sentence Prediction (NSP): BERT was trained to predict whether two sentences are consecutive, though RoBERTa later showed this was unnecessary
RoBERTa improvements: Removed NSP, used dynamic masking (different mask each epoch), trained with larger batches (8K sequences), more data (160GB), and longer training, substantially improving performance
Fine-tuning paradigm: Established the pretrain-then-finetune approach that dominated NLP for years

Common Use Cases

Text classification (sentiment analysis, spam detection), named entity recognition, question answering (SQuAD), semantic similarity, natural language inference, information retrieval, and as feature extractors for downstream NLP tasks.

Notable Variants & Sizes

BERT-Base (110M, 12 layers, 768-dim), BERT-Large (340M, 24 layers, 1024-dim). RoBERTa uses the same sizes with improved training. Domain-specific variants include SciBERT, BioBERT, ClinicalBERT, CodeBERT, and multilingual mBERT (104 languages). DistilBERT (66M) offers a compressed version retaining 97% of performance.

Technical Details

BERT-Base: 12 layers, 12 attention heads, hidden dim 768, FFN dim 3072, 512 max sequence length, 30K WordPiece vocabulary. Trained on BooksCorpus (800M words) + English Wikipedia (2.5B words). RoBERTa trained on 160GB of text including CC-News, OpenWebText, and Stories. Training uses Adam with linear warmup and decay, batch size 256 (BERT) or 8K (RoBERTa), and takes 1-3 days on 64-256 GPUs.