Computer Vision

In Short

Computer vision is the field of AI that teaches machines to interpret and understand visual information from images and video. It powers everything from facial recognition to autonomous vehicles, and today's vision models increasingly share the same architecture as language models.

01. What It Is

Computer vision (CV) is a branch of artificial intelligence concerned with giving machines the ability to extract meaningful information from digital images, video, and other visual inputs. The goal is not simply to store pixels but to produce structured understanding: what objects are present, where they are, what they mean in context, and how the scene changes over time.

The field sits at the intersection of computer science, mathematics, and optics. It draws heavily from signal processing, linear algebra, and, increasingly, deep learning. Modern CV systems are trained end-to-end on large labeled datasets rather than hand-coded with rules.

02. Why It Matters

Visual perception is central to how humans navigate the world, so equipping machines with that capability unlocks an enormous range of applications. Autonomous vehicles must detect pedestrians and road signs in real time. Radiologists use CV-assisted tools to catch tumors in X-rays and MRI scans that a human eye might miss. Manufacturing lines use it for defect detection. Smartphones use it for portrait blur and face unlock. Search engines use it to index the content of billions of images.

CV also sits at the center of the current multimodal AI moment. Large models like GPT-4V and Google Gemini accept both text and images as input, which requires a CV encoder to translate visual content into representations the language model can reason over.
Understanding CV is therefore essential to understanding how multimodal systems work (see Multimodal Models).

03. How It Works

Image classification

The simplest CV task: given an image, output a label. A model trained on ImageNet learns to distinguish 1,000 categories, from "tabby cat" to "container ship." Classification does not say where in the image the object is, only what it is.

Object detection

Detection adds localization. The model outputs bounding boxes around each detected object along with class labels and confidence scores. Architectures like YOLO (You Only Look Once) and Faster R-CNN made real-time detection practical. YOLO divides the image into a grid and predicts boxes and classes in a single forward pass, making it fast enough for video.

Semantic segmentation

Segmentation assigns a class label to every pixel in the image. In semantic segmentation, all pixels belonging to "road" get the same label regardless of whether they are part of one road or many. In instance segmentation (used by models like Mask R-CNN), the model also distinguishes between separate instances of the same class, so "car 1" and "car 2" get different labels even though both are cars.

Convolutional neural networks (CNNs)

CNNs are the foundational architecture for most CV work from roughly 2012 to 2020. The core operation is the convolution: a small filter (e.g., 3x3 pixels) slides across the input image, computing a dot product at each position. This produces a feature map that highlights where in the image certain low-level patterns (edges, corners, textures) appear. Stacking many such layers allows the network to learn progressively more abstract features: edges in early layers, shapes in middle layers, object parts in later layers.

Pooling layers (typically max pooling) downsample feature maps, reducing spatial resolution while retaining the most active signals. This gives CNNs translation invariance: a cat in the top-left corner activates the same "cat" neurons as a cat in the bottom-right corner.

Key CNN architectures and what they contributed:

LeNet (1998):
Yann LeCun's landmark design for digit recognition. Proved the concept.
AlexNet (2012):
Won ImageNet with a top-5 error of 15.3%, versus 26.2% for the runner-up. Used ReLU activations, dropout regularization, and trained on two GPUs. Triggered the deep learning explosion.
VGG (2014):
Showed that depth (16-19 layers) with very small 3x3 filters outperforms shallow networks with large filters.
ResNet (2015):
Introduced residual connections (skip connections), allowing networks to be trained reliably at 50-152 layers. Addressed the vanishing gradient problem that made deep nets hard to train.

Transfer learning

Training a CNN from scratch on millions of labeled examples is expensive. Transfer learning reuses a model pre-trained on a large dataset (typically ImageNet) as a starting point. The earlier layers, which have learned general edge and texture detectors, are frozen or fine-tuned with a small learning rate. Only the later, task-specific layers are retrained on the new dataset. This allows good performance with far fewer labeled examples and much less compute.

Vision Transformers (ViT)

In 2020, researchers at Google Brain showed that a pure Transformer architecture, applied directly to images, matches or exceeds CNN performance when trained on large datasets. The key insight in the paper "An Image is Worth 16x16 Words" (Dosovitskiy et al., arXiv 2010.11929) is to treat an image as a sequence of patches. A 224x224 image is split into a grid of 16x16-pixel patches. Each patch is linearly projected into an embedding vector, a position embedding is added, and the resulting sequence is fed into a standard Transformer encoder.

ViTs have less built-in inductive bias than CNNs (CNNs assume local structure and translation equivariance by design). This means ViTs need more data to reach their potential, but at scale they generalize better and are easier to integrate with language models, since both use the same Transformer backbone. Most modern multimodal models use a ViT as their image encoder.

04. Key Terms / Milestones

Term	Definition
ImageNet	A dataset of 14 million labeled images across 20,000+ categories, used for the annual ILSVRC benchmark. The 2012 competition result with AlexNet is considered the starting gun of the deep learning era.
ILSVRC	ImageNet Large Scale Visual Recognition Challenge. Annual competition 2010-2017 that drove rapid progress in image classification.
Feature map	The output of applying a convolutional filter to an image or to a previous layer's output.
Receptive field	The region of the input image that influences a given neuron's activation. Deeper layers have larger receptive fields.
Bounding box	A rectangle described by (x, y, width, height) that localizes an object in an image.
Anchor boxes	Pre-defined box shapes used by detection models as reference templates for predicting object locations.
Mean Average Precision (mAP)	The standard metric for evaluating object detection models, averaging precision across recall levels and object categories.
COCO	Common Objects in Context. The dominant benchmark dataset for object detection, segmentation, and keypoint estimation (330,000 images, 80 object categories).

05. Examples

Medical imaging:
CV models detect diabetic retinopathy from retinal photographs, identify cancerous nodules in CT scans, and flag anomalies in histology slides. The FDA has cleared several AI-assisted diagnostic tools that use CNN or ViT architectures.

Autonomous driving:
Self-driving systems combine CV (camera-based lane and object detection) with LiDAR and radar. Tesla's Autopilot relies heavily on camera-only CV. The perception stack must classify objects, estimate their 3D positions, and predict their future trajectories in real time.

Optical character recognition (OCR):
Reading text in images, from scanned documents to street signs to receipts. Modern OCR pipelines use a CNN to extract features, then a sequence model to decode the character sequence.

Multimodal AI:
GPT-4V, Claude 3/3.5 (Sonnet, Opus), and Gemini all accept image inputs. The image is encoded by a ViT into token-like embeddings that are concatenated with text token embeddings before being processed by the language model's Transformer decoder.

06. Common Pitfalls / Misconceptions

Accuracy on benchmarks does not equal real-world reliability:
A model that achieves 97% top-1 accuracy on ImageNet can fail badly on images from slightly different distributions, unusual lighting, or adversarial perturbations. Benchmark performance is necessary but not sufficient.

More pixels is not always better:
Many CV models downsample inputs to a fixed resolution (224x224 is common). Very high-resolution inputs can actually hurt if the architecture was not designed for them, because the positional embeddings do not generalize to unseen patch counts.

Object detection models output confidence scores, not probabilities:
A score of 0.9 does not mean the model is 90% sure. Calibration is a separate concern, and uncalibrated scores are routinely misread as probabilities in production systems.

CNNs are not dead:
ViTs dominate at scale, but CNN-based architectures like ConvNeXt remain competitive and are often faster and more efficient at smaller scales.