Standard CNN: Architecture & How It Works

The Convolutional Neural Network (CNN) is the foundational architecture for computer vision, using learnable spatial filters to automatically extract hierarchical visual features from images, from low-level edges to high-level semantic concepts.

Architecture Overview

A standard CNN consists of alternating convolutional layers and pooling layers, followed by fully connected layers for classification. Each convolutional layer applies a set of learnable filters (kernels) across the spatial dimensions of the input, producing feature maps that highlight the presence of specific patterns.

A convolution operation slides a small filter (typically 3×3 or 5×5) across the input, computing element-wise multiplication and summation at each position. Each filter detects a specific feature (edge, texture, shape) regardless of its spatial location (translation equivariance). The output feature map is passed through a nonlinear activation function (ReLU) and optionally a pooling layer (max pooling) that reduces spatial dimensions.

Early layers learn low-level features (edges, colors, textures), middle layers combine these into parts (eyes, wheels, letters), and deeper layers recognize high-level concepts (faces, cars, words). The final feature maps are flattened and passed through fully connected layers to produce class predictions.

Key Innovations

Local connectivity: Each neuron connects only to a small spatial region, dramatically reducing parameters compared to fully connected networks
Weight sharing: The same filter is applied across all spatial positions, providing translation equivariance and further parameter reduction
Hierarchical features: Stacking layers creates a hierarchy from simple features to complex concepts, mimicking the visual cortex
Pooling for invariance: Max/average pooling provides some translation invariance and progressively reduces spatial resolution

Common Use Cases

Image classification, object detection, facial recognition, medical image analysis, autonomous driving perception, optical character recognition (OCR), video analysis, and as feature extractors for many computer vision tasks.

Notable Variants & Sizes

LeNet-5 (1998, 60K params, handwritten digits), AlexNet (2012, 60M, ImageNet breakthrough), VGGNet (2014, 138M, uniform 3×3 filters), GoogLeNet/Inception (2014, 6.8M, inception modules with parallel filter sizes), ResNet (2015, skip connections), and EfficientNet (2019, compound scaling). Modern CNNs like ConvNeXt (2022) incorporate Transformer-era design principles.

Technical Details

AlexNet (landmark architecture): 5 conv layers + 3 FC layers. Conv1: 96 filters of 11×11, stride 4. Conv2: 256 filters of 5×5. Conv3-5: 384, 384, 256 filters of 3×3. FC: 4096→4096→1000. Total: 60M parameters, trained on 2 GPUs. VGG-16: 13 conv layers (all 3×3, stride 1) + 3 FC layers, channels 64→128→256→512→512, max pool 2×2 between stages. 138M parameters. Key hyperparameters: filter size (3×3 standard), stride (1-2), padding ("same" or "valid"), number of filters per layer (64-512), pooling size (2×2). Training: SGD with momentum 0.9, weight decay 5e-4, step learning rate decay, batch size 128-256, data augmentation (random crops, horizontal flips, color jittering).