Neural Networks and Deep Learning

In Short

A neural network is a stack of simple units, each one multiplying its inputs by weights, summing them, and making a small yes-or-no-ish decision. The network learns by repeatedly checking its guesses against known answers and nudging those internal numbers to be a little less wrong. Deep learning just means stacking many such layers, and it was that depth, paired with fast GPUs and large datasets after 2012, that turned a 1940s idea into the engine behind modern AI.

01. What It Is

A neural network is a mathematical function built from layers of simple units. The classic teaching example is reading a handwritten digit. You feed in the pixels of a scanned number, and the network outputs its best guess at the digit, 0 through 9.

Each neuron has a tiny job. It multiplies every input by a weight, sums the results, adds one more number called a bias, and runs that total through an activation function to produce a single output. That is the whole operation of a neuron.

Neurons sit in three kinds of layers. The input layer holds the pixels, the output layer holds the ten possible answers, and everything between is a hidden layer, which, as Michael Nielsen puts it, "really means nothing more than 'not an input or an output'." The only parts a network learns are its weights and biases. A tiny teaching network holds around 26 of them, a modern one holds millions to billions, and everything the network "knows" lives in those numbers.

02. Why It Matters

No person can write down every squiggle that counts as a 7. A neural network learns the pattern from labeled examples instead of from hand-written rules, starting with random weights and finding its own internal features. The 1986 paper that popularized modern training found that hidden units "come to represent important features of the task domain" on their own, a capability that set backpropagation apart from earlier, simpler methods. This architecture now sits under almost all of modern AI, including the large language models the rest of this site covers.

03. How It Works

The neuron up close

A weight encodes how much an input matters, so a large weight lets that input strongly sway the neuron. The bias encodes how easy the neuron is to switch on, what Nielsen calls "a measure of how easy it is to get the perceptron to fire." Picture a neuron as weighing evidence for a yes-or-no decision, leaning harder on the factors that count for more.

This unit began as the perceptron in the 1950s and 60s. Frank Rosenblatt built it on earlier work by Warren McCulloch and Walter Pitts. It took yes/no inputs and fired a 0 or a 1 depending on whether its weighted sum crossed a threshold.

The activation function, the bend that matters

A perceptron learns poorly because its output can jump from 0 to 1 on a tiny weight change, which blocks gradual improvement. The smoother sigmoid neuron replaced it because "small changes in their weights and bias cause only a small change in their output." Sigmoid squashes any number into the range 0 to 1.

The activation used most today is ReLU, which outputs zero for negatives and passes positives through unchanged. CS231n's blunt advice is to "use the ReLU non-linearity ... Never use sigmoid," partly for speed, since the AlexNet authors measured ReLU training about six times faster than sigmoid and tanh.

That bend is what gives a network its power. Strip the non-linearity and a stack of layers collapses into one layer, no more capable than basic regression. CS231n puts it plainly, "the non-linearity is where we get the wiggle."

Layers and the forward pass

When the network reads a digit, the pixels enter the input layer and flow one direction toward the output, with no loops back. Each layer's output is the next layer's input. A forward pass is just repeated rounds of multiply-and-add, each followed by the activation function. Early layers might catch strokes and loops, and later layers combine them into a verdict like "this looks like a 7."

How a network learns

Learning runs in a loop of three beats. The forward pass produces a guess. A loss function, also called a cost function, scores how wrong the guess is. Nielsen's quadratic cost nears zero only when the output is close to the desired answer, and training means driving it as low as it goes.

The network then adjusts its weights with gradient descent. It measures the slope of the loss where it currently sits, takes a small step in the downhill direction, and repeats. The size of each step is the learning rate.

Computing the right slope for millions of weights is the job of backpropagation. It works backward from the output error, layer by layer, handing each weight its share of the blame using the chain rule, which is why Nielsen calls it "the workhorse of learning in neural networks." Introduced in the 1970s, it was made famous by Rumelhart, Hinton, and Williams in a 1986 Nature paper. This loop repeats over millions of examples.

What makes it "deep"

"Deep" simply means many hidden layers stacked together, with no official cutoff. Goodfellow, Bengio, and Courville define deep learning as building "a hierarchy of concepts," each one defined through simpler concepts, and they note "there is no consensus about how much depth a model requires to qualify as 'deep'."

Depth matters because each layer builds a more abstract description than the one before. In an image network the first layer finds raw edges, the next corners, the next object parts, and a later one whole objects. That staged feature-building is the real meaning of "deep." You do not strictly need it, since one hidden layer can in theory approximate any continuous function, proven for sigmoid networks by Cybenko in 1989. CS231n calls that "mathematically cute" but practically weak, which is why real systems still go deep.

Why depth and scale unlocked modern AI

The idea is old. Deep learning is roughly the third rebranding of a 1940s concept, after cybernetics in the 1940s to 60s and connectionism in the 1980s to 90s, with the current wave running from 2006 on, so the field "only appears to be new." What changed was scale. Since hidden units arrived, networks have doubled in size about every 2.4 years, and larger networks reach higher accuracy on harder tasks.

The turning point was AlexNet in 2012, when three older ingredients arrived together. The ImageNet dataset supplied enough labeled examples, GPUs supplied the parallel computing power, and the authors committed to real depth. Their 8-layer, 60-million-weight network from Krizhevsky, Sutskever, and Hinton cut the ImageNet top-5 error to 15.3%, more than 10.8 points ahead of the runner-up, trained on two Nvidia GTX 580 GPUs. The depth was essential, and only the GPUs made it trainable.

A transformer is one particular deep-network architecture, and a large language model is a very deep, very large network trained with the exact loop above, forward pass to loss to backprop to gradient descent. What changed for language models is the shape of the layers and the number of weights, not the learning machinery, and both are covered in transformers-and-attention.md and what-is-an-llm.md.

04. Key Terms

Term	Plain meaning
Neuron / unit	The basic building block. It multiplies each input by a weight, sums them, adds a bias, and runs the total through an activation function to make one output number.
Weight	How much one input matters to a neuron. A big weight means strong influence. Weights are what the network learns.
Bias	A per-neuron number that shifts how easily the neuron switches on, its default eagerness to fire before any input arrives.
Activation function	The small non-linear step each neuron applies to its sum. ReLU ("zero out negatives") and sigmoid ("squash to 0-to-1") are common. Without it, many layers behave like one.
Hidden layer	Any layer between input and output, "hidden" only because you never read its values directly. Stacking many of them is what makes a network "deep."
Loss function	The scorecard. One number measuring how far the network's answers are from the correct ones. Training makes it small.
Backpropagation	The method that works backward from the network's error to find how much each weight caused the mistake, so gradient descent can adjust it. The learning rate sets the step size.

05. Examples

A deep network works like an assembly line for understanding. Each station adds a little structure to the previous one's output. Raw pixels become edges, edges become shapes, and shapes become a recognizable digit.

Gradient descent is like a ball rolling down a foggy hill. You cannot see the whole valley, so you feel for the downhill direction and step that way. The learning rate is your stride. Too long and you overshoot the bottom, too short and you crawl.

06. Common Misconceptions

"A neural network works like a human brain."
It is loosely inspired by biological neurons, but the resemblance is shallow. Each artificial neuron is a one-line piece of arithmetic. CS231n calls the model "very coarse" and says neuroscientists groan at the analogy. A real brain has around 86 billion neurons and on the order of 100 trillion to a quadrillion connections, and it does not learn by the calculus a network uses.

"Someone programs the rules into the network."
No one writes the rules. The network starts from random weights and discovers its own features from the data, exactly as the 1986 backpropagation result showed. The core ideas reach back to the 1940s, and only the scale is new.

"'Deep' means the AI thinks or understands deeply."
Deep is structural, a word about the count of stacked layers and nothing more. The network does not contemplate. It passes numbers through many stages, and there is no agreed cutoff for how many layers count as "deep."

"More layers always means a better model."
Depth helps only with enough data and careful training. The Universal Approximation Theorem shows even one hidden layer can represent almost any pattern in theory, and past a point, extra depth without enough data makes a model memorize noise.