ResNet (Residual Networks) and DenseNet (Densely Connected Networks) are landmark CNN architectures that solved the degradation problem in deep networks through skip connections, enabling training of networks with hundreds or thousands of layers.
Architecture Overview
ResNet introduces residual connections (skip connections) that add the input of a block directly to its output: y = F(x) + x. This means each block only needs to learn the residual mapping F(x) = H(x) - x rather than the full mapping H(x), making it easier to train very deep networks. A typical ResNet block contains two or three convolutional layers with batch normalization and ReLU activation.
DenseNet takes connectivity further: in a dense block, every layer receives feature maps from all preceding layers and passes its own feature maps to all subsequent layers. Layer l receives concatenated features from layers 0, 1, ..., l-1, resulting in l(l+1)/2 connections in an L-layer block. Each layer produces k feature maps (the "growth rate"), so the number of channels grows linearly within each block.
Both architectures use a stem (initial convolution + pooling), followed by stages of blocks with downsampling between stages, and end with global average pooling and a fully connected classifier.
Key Innovations
- Residual learning (ResNet): Skip connections enable gradient flow through the identity path, solving vanishing gradients and allowing training of 100+ layer networks
- Bottleneck blocks (ResNet): 1×1 → 3×3 → 1×1 convolution pattern reduces computation while increasing depth
- Dense connectivity (DenseNet): Feature reuse through concatenation reduces redundancy and parameters—DenseNet-121 matches ResNet-200 with 4× fewer parameters
- Growth rate (DenseNet): Each layer adds only k new feature maps (typically k=32), keeping the network narrow while feature maps accumulate through concatenation
Common Use Cases
Image classification, object detection (as backbone in Faster R-CNN), semantic segmentation, medical image analysis, feature extraction, transfer learning (pretrained ResNets are the most widely used vision backbones), and as baselines for vision research.
Notable Variants & Sizes
ResNet-18 (11M), ResNet-34 (22M), ResNet-50 (25M), ResNet-101 (45M), ResNet-152 (60M). ResNeXt adds grouped convolutions. Wide ResNet increases channel width. SE-ResNet adds channel attention. DenseNet-121 (8M), DenseNet-169 (14M), DenseNet-201 (20M), DenseNet-264 (34M).
Technical Details
ResNet-50: stem (7×7 conv, stride 2, maxpool) → 4 stages with [3, 4, 6, 3] bottleneck blocks, channels [256, 512, 1024, 2048]. Bottleneck block: 1×1 (reduce) → 3×3 → 1×1 (expand), with BN-ReLU after each conv. DenseNet-121: stem (7×7 conv, 2×2 maxpool) → 4 dense blocks with [6, 12, 24, 16] layers, growth rate k=32. Transition layers between blocks: 1×1 conv (compression θ=0.5) + 2×2 avg pool. Training: SGD with momentum 0.9, weight decay 1e-4, step LR decay, 90-300 epochs on ImageNet. BatchNorm is applied before every convolution in both architectures.