MobileNet & EfficientNet: Architecture & How They Work

MobileNet and EfficientNet are efficiency-focused CNN architectures designed for deployment on mobile devices and edge hardware, achieving strong accuracy with dramatically fewer parameters and computations than standard CNNs through innovative building blocks and scaling strategies.

Architecture Overview

MobileNet's key building block is the depthwise separable convolution, which factorizes a standard convolution into two steps: a depthwise convolution (one filter per input channel, capturing spatial patterns) followed by a pointwise 1×1 convolution (mixing information across channels). This reduces computation by approximately 8-9× compared to standard convolutions.

MobileNetV2 introduces the inverted residual block with linear bottleneck: 1×1 expand (increase channels 6×) → 3×3 depthwise conv → 1×1 project (reduce channels back). Unlike ResNet's bottleneck which narrows then widens, MobileNetV2 widens then narrows, with the residual connection between the narrow representations.

EfficientNet uses Neural Architecture Search (NAS) to find an optimal base architecture (EfficientNet-B0), then scales it up uniformly across depth, width, and resolution using a compound scaling method: depth d = α^φ, width w = β^φ, resolution r = γ^φ, subject to α·β²·γ² ≈ 2.

Key Innovations

Depthwise separable convolutions: Factoring spatial and channel operations reduces computation from O(K²·C_in·C_out·H·W) to O(K²·C_in·H·W + C_in·C_out·H·W)
Inverted residuals: Expanding in the middle and residual-connecting the narrow bottlenecks preserves information while being memory-efficient
Compound scaling (EfficientNet): Principled simultaneous scaling of depth, width, and resolution balances network dimensions for optimal accuracy-efficiency tradeoff
Squeeze-and-Excitation (EfficientNet): Channel attention that adaptively recalibrates feature responses, boosting useful channels and suppressing less relevant ones

Common Use Cases

Mobile and edge device deployment, real-time object detection (SSD-MobileNet), on-device image classification, AR applications, drone/robot vision, efficient transfer learning, and as lightweight backbones for detection and segmentation on resource-constrained hardware.

Notable Variants & Sizes

MobileNetV1 (3.4M), MobileNetV2 (3.4M), MobileNetV3-Small (2.5M), MobileNetV3-Large (5.4M, NAS-optimized with h-swish activation). EfficientNet-B0 (5.3M, 390M FLOPs) to EfficientNet-B7 (66M, 37B FLOPs). EfficientNetV2 improves training speed with progressive learning and Fused-MBConv blocks. MNASNet, FBNet, and Once-for-All are other NAS-based efficient architectures.

Technical Details

MobileNetV2: inverted residual blocks with expansion factor 6. Architecture: conv2d 3×3 (32 filters) → 17 inverted residual blocks across 7 stages (channels: 16→24→32→64→96→160→320) → conv2d 1×1 (1280) → global avg pool → FC 1000. Width multiplier α ∈ {0.35, 0.5, 0.75, 1.0} and resolution multiplier {96-224} tune the efficiency-accuracy tradeoff. EfficientNet-B0: MBConv blocks with SE ratio 0.25, 5 stages, channels 16→24→40→80→112→192→320. Compound scaling coefficients: α=1.2, β=1.1, γ=1.15. Training: RMSProp (MobileNet) or SGD (EfficientNet), AutoAugment, label smoothing 0.1, dropout 0.2-0.5, stochastic depth 0.2. EfficientNet-B0 achieves 77.1% top-1 on ImageNet with only 5.3M params and 390M FLOPs.