03. How It Works
Image classification
The simplest CV task: given an image, output a label. A model trained on ImageNet learns to distinguish 1,000 categories, from "tabby cat" to "container ship." Classification does not say where in the image the object is, only what it is.
Object detection
Detection adds localization. The model outputs bounding boxes around each detected object along with class labels and confidence scores. Architectures like YOLO (You Only Look Once) and Faster R-CNN made real-time detection practical. YOLO divides the image into a grid and predicts boxes and classes in a single forward pass, making it fast enough for video.
Semantic segmentation
Segmentation assigns a class label to every pixel in the image. In semantic segmentation, all pixels belonging to "road" get the same label regardless of whether they are part of one road or many. In instance segmentation (used by models like Mask R-CNN), the model also distinguishes between separate instances of the same class, so "car 1" and "car 2" get different labels even though both are cars.
Convolutional neural networks (CNNs)
CNNs are the foundational architecture for most CV work from roughly 2012 to 2020. The core operation is the convolution: a small filter (e.g., 3x3 pixels) slides across the input image, computing a dot product at each position. This produces a feature map that highlights where in the image certain low-level patterns (edges, corners, textures) appear. Stacking many such layers allows the network to learn progressively more abstract features: edges in early layers, shapes in middle layers, object parts in later layers.
Pooling layers (typically max pooling) downsample feature maps, reducing spatial resolution while retaining the most active signals. This gives CNNs translation invariance: a cat in the top-left corner activates the same "cat" neurons as a cat in the bottom-right corner.
Key CNN architectures and what they contributed:
- LeNet (1998): Yann LeCun's landmark design for digit recognition. Proved the concept.
- AlexNet (2012): Won ImageNet with a top-5 error of 15.3%, versus 26.2% for the runner-up. Used ReLU activations, dropout regularization, and trained on two GPUs. Triggered the deep learning explosion.
- VGG (2014): Showed that depth (16-19 layers) with very small 3x3 filters outperforms shallow networks with large filters.
- ResNet (2015): Introduced residual connections (skip connections), allowing networks to be trained reliably at 50-152 layers. Addressed the vanishing gradient problem that made deep nets hard to train.
Transfer learning
Training a CNN from scratch on millions of labeled examples is expensive. Transfer learning reuses a model pre-trained on a large dataset (typically ImageNet) as a starting point. The earlier layers, which have learned general edge and texture detectors, are frozen or fine-tuned with a small learning rate. Only the later, task-specific layers are retrained on the new dataset. This allows good performance with far fewer labeled examples and much less compute.
Vision Transformers (ViT)
In 2020, researchers at Google Brain showed that a pure Transformer architecture, applied directly to images, matches or exceeds CNN performance when trained on large datasets. The key insight in the paper "An Image is Worth 16x16 Words" (Dosovitskiy et al., arXiv 2010.11929) is to treat an image as a sequence of patches. A 224x224 image is split into a grid of 16x16-pixel patches. Each patch is linearly projected into an embedding vector, a position embedding is added, and the resulting sequence is fed into a standard Transformer encoder.
ViTs have less built-in inductive bias than CNNs (CNNs assume local structure and translation equivariance by design). This means ViTs need more data to reach their potential, but at scale they generalize better and are easier to integrate with language models, since both use the same Transformer backbone. Most modern multimodal models use a ViT as their image encoder.