Standard MLP (Multilayer Perceptron): Architecture & How It Works

The Multilayer Perceptron (MLP) is the simplest and most fundamental neural network architecture, consisting of fully connected layers that learn arbitrary function approximations through nonlinear transformations of the input.

Architecture Overview

An MLP consists of an input layer, one or more hidden layers, and an output layer. Each layer is fully connected: every neuron connects to every neuron in the next layer. At each layer, the computation is: h = σ(Wx + b), where W is the weight matrix, b is the bias vector, and σ is a nonlinear activation function.

Data flows forward through the network: input x → hidden layer 1 (h_1 = σ(W_1·x + b_1)) → hidden layer 2 (h_2 = σ(W_2·h_1 + b_2)) → ... → output layer (y = W_out·h_L + b_out). The output activation depends on the task: softmax for classification, linear for regression, sigmoid for binary classification.

Training uses backpropagation: the loss gradient is computed from the output layer backward through each hidden layer using the chain rule, and weights are updated via gradient descent (SGD, Adam, etc.). The Universal Approximation Theorem guarantees that even a single hidden layer can approximate any continuous function given enough neurons.

Key Innovations

Backpropagation (1986): Efficient gradient computation through the chain rule enabled training of multi-layer networks, sparking the neural network revolution
Activation functions: Progression from sigmoid/tanh → ReLU (2011) → GELU/SiLU (modern) solved vanishing gradient problems and improved training speed
Dropout regularization: Randomly zeroing neurons during training prevents co-adaptation and reduces overfitting
Batch normalization: Normalizing layer inputs stabilizes training and allows higher learning rates

Common Use Cases

Tabular data classification and regression, function approximation, as the feed-forward sub-layer within Transformers and other architectures, reinforcement learning policy and value networks, physics-informed neural networks (PINNs), neural radiance fields (NeRF), and as baseline models for benchmarking.

Notable Variants & Sizes

Standard MLP (fully connected), MLP-Mixer (patch-based vision MLP), gMLP (gated MLP for NLP), ResMLP (residual MLP for vision), KAN (Kolmogorov-Arnold Networks, learnable activation functions on edges instead of nodes). In practice, modern MLPs within Transformers use SwiGLU: FFN(x) = (xW_1 ⊙ SiLU(xW_gate))W_2.

Technical Details

A 2-layer MLP with hidden sizes [512, 256] for 784-dim input (MNIST): Layer 1 (784×512 + 512 = 401,920 params), Layer 2 (512×256 + 256 = 131,328 params), Output (256×10 + 10 = 2,570 params). Total: ~536K params. Training: Adam (lr=1e-3), batch size 128, ReLU activations, dropout 0.2-0.5. Achieves ~98.5% on MNIST. The MLP within a Transformer layer typically has hidden dim = 4× model dim (e.g., 768→3072→768) and accounts for ~2/3 of the layer's parameters. Modern SwiGLU FFN uses 8/3× expansion to match parameter count: dim → (8/3)·dim with gating → dim. Xavier or Kaiming initialization is standard for weight matrices.