03. How It Works
The neuron up close
A weight encodes how much an input matters, so a large weight lets that input strongly sway the neuron. The bias encodes how easy the neuron is to switch on, what Nielsen calls "a measure of how easy it is to get the perceptron to fire." Picture a neuron as weighing evidence for a yes-or-no decision, leaning harder on the factors that count for more.
This unit began as the perceptron in the 1950s and 60s. Frank Rosenblatt built it on earlier work by Warren McCulloch and Walter Pitts. It took yes/no inputs and fired a 0 or a 1 depending on whether its weighted sum crossed a threshold.
The activation function, the bend that matters
A perceptron learns poorly because its output can jump from 0 to 1 on a tiny weight change, which blocks gradual improvement. The smoother sigmoid neuron replaced it because "small changes in their weights and bias cause only a small change in their output." Sigmoid squashes any number into the range 0 to 1.
The activation used most today is ReLU, which outputs zero for negatives and passes positives through unchanged. CS231n's blunt advice is to "use the ReLU non-linearity ... Never use sigmoid," partly for speed, since the AlexNet authors measured ReLU training about six times faster than sigmoid and tanh.
That bend is what gives a network its power. Strip the non-linearity and a stack of layers collapses into one layer, no more capable than basic regression. CS231n puts it plainly, "the non-linearity is where we get the wiggle."
Layers and the forward pass
When the network reads a digit, the pixels enter the input layer and flow one direction toward the output, with no loops back. Each layer's output is the next layer's input. A forward pass is just repeated rounds of multiply-and-add, each followed by the activation function. Early layers might catch strokes and loops, and later layers combine them into a verdict like "this looks like a 7."
How a network learns
Learning runs in a loop of three beats. The forward pass produces a guess. A loss function, also called a cost function, scores how wrong the guess is. Nielsen's quadratic cost nears zero only when the output is close to the desired answer, and training means driving it as low as it goes.
The network then adjusts its weights with gradient descent. It measures the slope of the loss where it currently sits, takes a small step in the downhill direction, and repeats. The size of each step is the learning rate.
Computing the right slope for millions of weights is the job of backpropagation. It works backward from the output error, layer by layer, handing each weight its share of the blame using the chain rule, which is why Nielsen calls it "the workhorse of learning in neural networks." Introduced in the 1970s, it was made famous by Rumelhart, Hinton, and Williams in a 1986 Nature paper. This loop repeats over millions of examples.
What makes it "deep"
"Deep" simply means many hidden layers stacked together, with no official cutoff. Goodfellow, Bengio, and Courville define deep learning as building "a hierarchy of concepts," each one defined through simpler concepts, and they note "there is no consensus about how much depth a model requires to qualify as 'deep'."
Depth matters because each layer builds a more abstract description than the one before. In an image network the first layer finds raw edges, the next corners, the next object parts, and a later one whole objects. That staged feature-building is the real meaning of "deep." You do not strictly need it, since one hidden layer can in theory approximate any continuous function, proven for sigmoid networks by Cybenko in 1989. CS231n calls that "mathematically cute" but practically weak, which is why real systems still go deep.
Why depth and scale unlocked modern AI
The idea is old. Deep learning is roughly the third rebranding of a 1940s concept, after cybernetics in the 1940s to 60s and connectionism in the 1980s to 90s, with the current wave running from 2006 on, so the field "only appears to be new." What changed was scale. Since hidden units arrived, networks have doubled in size about every 2.4 years, and larger networks reach higher accuracy on harder tasks.
The turning point was AlexNet in 2012, when three older ingredients arrived together. The ImageNet dataset supplied enough labeled examples, GPUs supplied the parallel computing power, and the authors committed to real depth. Their 8-layer, 60-million-weight network from Krizhevsky, Sutskever, and Hinton cut the ImageNet top-5 error to 15.3%, more than 10.8 points ahead of the runner-up, trained on two Nvidia GTX 580 GPUs. The depth was essential, and only the GPUs made it trainable.
A transformer is one particular deep-network architecture, and a large language model is a very deep, very large network trained with the exact loop above, forward pass to loss to backprop to gradient descent. What changed for language models is the shape of the layers and the number of weights, not the learning machinery, and both are covered in transformers-and-attention.md and what-is-an-llm.md.