GCN, GAT & GraphSAGE: Architecture & How They Work

Graph Convolutional Networks (GCN), Graph Attention Networks (GAT), and GraphSAGE are foundational graph neural network architectures that learn representations of nodes, edges, and graphs by aggregating information from local neighborhoods in graph-structured data.

Architecture Overview

All three architectures follow a message-passing framework where each node iteratively updates its representation by aggregating features from its neighbors. A K-layer GNN allows each node to incorporate information from its K-hop neighborhood.

GCN applies a spectral convolution approximated as a localized first-order filter: H^(l+1) = σ(D̃^(-1/2) Ã D̃^(-1/2) H^(l) W^(l)), where Ã = A + I is the adjacency matrix with self-loops, D̃ is the degree matrix, H^(l) is the feature matrix at layer l, and W^(l) is a learnable weight matrix. This effectively averages neighbor features with symmetric normalization.

GAT replaces fixed normalization with learned attention coefficients: e_ij = LeakyReLU(a^T [Wh_i || Wh_j]), α_ij = softmax_j(e_ij), h_i' = σ(Σ_j α_ij W h_j). Multi-head attention concatenates or averages outputs from K independent attention heads.

GraphSAGE decouples aggregation from transformation: h_N(v) = AGGREGATE({h_u, ∀u ∈ N(v)}), h_v' = σ(W·[h_v || h_N(v)]). It supports multiple aggregators (mean, LSTM, max-pooling) and uses neighborhood sampling for scalability.

Key Innovations

GCN's spectral-to-spatial bridge: Simplified spectral graph convolution into a practical spatial operation (neighborhood averaging), making GNNs accessible and efficient
GAT's learned attention: Adaptive weighting of neighbors based on feature similarity, handling graphs with heterogeneous node importance
GraphSAGE's inductive learning: Sampling fixed-size neighborhoods enables generalization to unseen nodes and graphs, unlike transductive GCN/GAT
Mini-batch training (GraphSAGE): Neighborhood sampling enables training on massive graphs that don't fit in memory

Common Use Cases

Social network analysis, recommendation systems (Pinterest uses GraphSAGE), molecular property prediction, drug discovery, citation network classification, traffic prediction, knowledge graph reasoning, fraud detection, and chip/circuit design optimization.

Notable Variants & Sizes

GCN (Kipf & Welling, 2017), GAT (Veličković et al., 2018), GATv2 (fixes static attention), GraphSAGE (Hamilton et al., 2017), GIN (Graph Isomorphism Network), PNA (Principal Neighbourhood Aggregation), and GPS (General Powerful Scalable graph Transformer). Models typically have 2-8 layers due to oversmoothing.

Technical Details

GCN: typically 2-3 layers, hidden dim 16-256, ReLU activation, dropout 0.5. On Cora citation network (2708 nodes, 5429 edges): 2 layers, 16 hidden units, 81.5% accuracy. GAT: 2 layers, 8 attention heads (first layer) × 8 hidden units, 1 head (second layer) × 8 classes, ELU activation, attention dropout 0.6. GraphSAGE: 2 layers, sample 25 neighbors (layer 1) and 10 (layer 2), hidden dim 256, mean/max-pool aggregator. Training: Adam, lr 0.01 (GCN), 0.005 (GAT), full-batch (GCN/GAT) or mini-batch with neighborhood sampling (GraphSAGE). Oversmoothing limits depth: node representations converge to indistinguishable states beyond ~4-8 layers.