Backpropagation Step by Step
When training a neural network we aim to adjust these weights and biases such that the predictions improve. In this post, we discuss how backpropagation works, and explain it in detail for three simple examples. The first two examples will contain all the calculations, for the last one we will only illustrate the equations that need to be calculated. We will not go into the general formulation of the backpropagation algorithm but will give some further readings at the end. The chain rule is essential to calculating the derivatives of activation functions in neural networks, which are composed of the outputs of activation functions of other neurons in previous layers. At its most basic, a neural network takes input data and maps it to an output value.
Example: Two Neurons in a layer
Because the network is not yet well trained, the activations in that output layer are effectively random. That’s no good; we want to change these activations so that they properly identify the digit 2. The way to read this is that the cost function is 32 times more sensitive to changes to that first weight. So if you were to wiggle the value of that weight a bit, it’ll cause a change to the cost function 32 times greater than what the same wiggle to the second weight would cause.
Defining Neural Network
Whether you’re looking at images or words or raw numerical data all the network sees is numbers and it’s simply finding patterns in these numbers. The input data is filtered through a matrix of weights which are the parameters of the network and can number in the thousands to millions or billions. Fine tuning these weights to recognize the patterns is obviously not a task any human wants to or can do and so a method to do this was devised, several times but most notably in 1986 1.
We’ve now completed a forward pass and backward pass for a single training example. However, our goal is to train the model to generalize well to new inputs. To do so requires training on backpropagation tutorial a large number of samples that reflect the diversity and range of inputs the model will be tasked with making predictions on post-training.
The cost for a single training example is the sum of the squares of the differences between the actual output and the desired output. We use gradient descent to walk downhill to a local minimum of the cost function. The derivative solutions can then be subbed into the matrix equation per figure 14. Now if we look closely you’ll notice the repeated terms, figure 8 recalls the previous layer equations along side our current equation for w₁₁ and shows which terms are repeated and factors them as delta’s.
- Calculating gradients for millions of examples for each iteration of weight updates becomes inefficient.
- In Backward pass or Back Propagation the errors between the predicted and actual outputs are computed.
- One is that activations from the relu layers get transformed to become parameters of a linear layer of the backward network (see Equation 14.7).
Example of Back Propagation in Machine Learning
A simplified model is used to illustrate the concepts and to avoid overcomplicating the process too much. A 2 input, 2 output, 2 hidden layer network is used and illustrated in figure 1. The output nodes are denoted as e indicating the error, though you may also see them commonly denoted as C for the cost function. This would typically be a function like mean squared error (MSE) or binary cross entropy.
Changing the Activations
The method takes a neural networks output error and propagates this error backwards through the network determining which paths have the greatest influence on the output. Backprop is often presented as a method just for training neural networks, but it is actually a much more general tool than that. Backprop is an efficient way to find partial derivatives in computation graphs.
This map will visually guide us through the derivation and deliver us to our final destination, the formula’s of backpropagation. To be clear, we will still end up with many formula’s that look intimidating on their own but after viewing the process by which they evolve each equation should make sense and things become very systematic. With merge and branch, we can construct any DAG computation graph by simply inserting these layers wherever we want a layer to have multiple inputs or multiple outputs. Backpropagation is an algorithm that efficiently calculates the gradient of the loss with respect to each and every parameter in a computation graph. It relies on a special new operation, called backward that, just like forward, can be defined for each layer, and acts in isolation from the rest of the graph. But first, before we get to defining backward, we will build up some intuition about the key trick backpropagation will exploit.
Namely, if everything connected to that digit-2 neuron with a positive weight was brighter, and if everything connected with a negative weight was dimmer, that digit-2 neuron would be more active. Because this gets quite repetitive and because I only have so much length I can cram into a GIF, the process is repeated (very) quickly in figure 9 for all remaining weights. This tracing out of the edges and nodes is done for each path from the error node to each weight in the final layer, running through it quickly in figure 4. The weight subscript indexes may appear backwards but it will make more sense when we build the matrices. Indexing in this manner allows the rows of the matrix to line up with the rows of the neural network and the weight indexes agree with the typical (row, column) matrix indexing.
Because the process of backpropagation is so fundamental to how neural networks are trained, a helpful explanation of the process requires a working understanding of how neural networks make predictions. We can therefore use the “chain rule”, a calculus principle dating back to the 17th century, to compute the rate at which each neuron contributes to overall loss. In doing so, we can calculate the impact of changes to any variable—that is, to any weight or bias—within the equations those neurons represent. The collection of these averaged nudges to each weight and bias is, loosely speaking, the negative gradient of the cost function! To zoom out a bit, you also go through this same backpropagation routine for every other training example, recording how each of them would like to change the weights and biases. Everything we just stepped through only records how a single training example wishes to nudge each of the many, many weights and biases.
Such a computation graph could represent an MLP, for example, which we will see in the next section. Train, validate, tune and deploy generative AI, foundation models and machine learning capabilities with IBM watsonx.ai, a next-generation enterprise studio for AI builders. Build AI applications in a fraction of the time with a fraction of the data.
I say “loosely speaking” only because I have yet to get quantitatively precise about these nudges. But if you understood each change I referenced above, why some are proportionally bigger than others, and how they all need to be added together, then you understand the mechanics for what backpropagation is actually doing. Connections are strengthened between neurons that should be activated at the same time. Remember, when we talk about gradient descent, we don’t just care about whether each component should be nudged up or down. Moreover, the sizes of these nudges should be proportional to how far off each output value is from the target.
4 The Full Algorithm: Forward, Then Backward
Figure 2 indicates the notation for nodes and weights in the example network. Illustration of backpropagation in a neural network consisting of a single neuron. If not mentioned differently, we use the following data, activation function, and loss throughout the examples of this post.
Calculating Gradients
- Remember, when we talk about gradient descent, we don’t just care about whether each component should be nudged up or down.
- However, our goal is to train the model to generalize well to new inputs.
- In stochastic gradient descent (SGD), each epoch uses a single training example for each step.
There is a general term for this setup, where one neural net outputs values that parameterize another neural net; this is called a hypernetwork 1. The forward network is a hypernetwork that parameterizes the backward network. Our goal is to iteratively update weights until we have reached the minimum gradient. The object of gradient descent algorithms is to find the specific parameter adjustments that will move us down the gradient most efficiently. Moving down—descending—the gradient of the loss function will decrease the loss. Since the gradient we calculated during backpropagation contains the partial derivatives for every model parameter, we know which direction to “step” each of our parameters to reduce loss.
Notice that all these operations are simple expressions, mainly involving matrix multiplies. Forward and backward for a linear layer are also very easy to write in code, using any library that provides matrix multiplication (matmul) as a primitive. In the remaining sections, we will still focus only on the case of backpropagation for the loss at a single datapoint. As you read on, keep in mind that doing the same for batches simply requires applying Equation 14.3. The gradient of a sum of terms is the sum of the gradients of each term.
Each training example has its own desire for how the weights and biases should be adjusted, and with what relative strengths. By averaging together the desires of all training examples, we get the final result for how a given weight or bias should be changed in a single gradient descent step. The output of Lc’s activation function depends on the contributions that it receives from neurons in the penultimate layer, which we’ll call layer L-1. One way to change Lc’s output is to change the weights between the neurons in L-1 and Lc. By calculating the partial derivative of each L-1 weight with respect to the other weights, we can see how increasing or decreasing any of them will bring the output of Lc closer to (or further away from) 1. In a well-trained network, this model will consistently output a high probability value for the correct classification and output low probability values for the other, incorrect classifications.
Essentially, the output of a neuron is sent back as input to itself as well as going forward to the neurons in the next layer. Average loss is computed as an average (mean) of the loss over all the data points. The closer to zero the average loss is, the smaller the error in predictions. The choice of loss function depends on the specific task, the nature of the inputs and outputs, and many other factors, which will be discussed in more detail in Introduction to Deep Learning. To do so, we’ll need to know how any change in previous layers will change Lc’s own output.
Add a review
Your email address will not be published. Required fields are marked *