ImageGPT: Architecture & How It Works
ImageGPT (iGPT) applies the autoregressive GPT architecture directly to image generation by treating images as sequences of pixels or color clusters, demonstrating that language model approaches
Dispatches from the edge of chaos — on nonlinear dynamics, AI, emergence, and the mathematics of complex systems.
ImageGPT (iGPT) applies the autoregressive GPT architecture directly to image generation by treating images as sequences of pixels or color clusters, demonstrating that language model approaches
MobileNet and EfficientNet are efficiency-focused CNN architectures designed for deployment on mobile devices and edge hardware, achieving strong accuracy with dramatically fewer parameters and computations than
The Convolutional Neural Network (CNN) is the foundational architecture for computer vision, using learnable spatial filters to automatically extract hierarchical visual features from images, from low-level
VQ-VAE (Vector Quantized Variational Autoencoder) and VQGAN (Vector Quantized GAN) learn discrete codebook representations of images, enabling powerful image generation by converting the continuous pixel space
Latent Diffusion Models (LDMs), commercialized as Stable Diffusion, generate high-quality images by performing the diffusion process in a compressed latent space rather than pixel space, dramatically
U-Net is an encoder-decoder architecture with skip connections designed for biomedical image segmentation, producing pixel-precise segmentation masks by combining high-level semantic features with fine-grained spatial details.
ResNet (Residual Networks) and DenseNet (Densely Connected Networks) are landmark CNN architectures that solved the degradation problem in deep networks through skip connections, enabling training of
The Segment Anything Model (SAM) is Meta's foundation model for image segmentation that can segment any object in any image given a prompt (point,
DINO (Self-Distillation with No Labels) and DINOv2 are self-supervised learning methods that train Vision Transformers to learn powerful visual features without any labeled data, producing representations
DeiT (Data-efficient Image Transformers) and MAE (Masked Autoencoders) are two breakthrough approaches to training Vision Transformers effectively—DeiT through advanced training strategies and distillation, MAE through