- Published on
Backpropagation in Neural Network
This is a note I made for myself while trying to understand how to train a neural network. Although knowledge in calculus and artificial neural networks isn't necessarily required, having some familiarity with these topics is advisable. It's okay if you don't know any of those (or have forgotten); I will introduce the necessary concepts to understand backpropagation.
A bit of Calculus
What is derivative?
The derivative of with respect to , , represents the rate of change in for a small change in . In simpler terms, it's the ratio of the impact on as increases by a small value, .
For example, let . When , the value of is
If we add by a really small number, , say , how much would change?
The difference of and is . When we increased by , increases by , the ratio of impact is essentially , because is twice the value of , so .
This could also easily be shown using the definition of derivative:
Chain Rule
Let's raise the difficulty level a bit.
Let , and we have the value of and , what is now?
This can be seen as:
- A small change on corresponds to a ratio of impact on of .
- A small change on corresponds to a ratio of impact on of .
So, what is the ratio of impact for a small change in to ?
It's just . This is basically the chain rule of calculus.
While learning, I came across an intuitive example that makes easier to understand:
"If a car travels twice as fast as a bicycle and the bicycle is four times as fast as a walking man, then the car travels times as fast as the man."
Formally ,when
Chain Rule with One Independent Variable
Let , assuming we have , what is ?
We can find using the equation as follows:
To understand why this holds true, let's examine the impact ratio of on both and separately.
When we increase by ,
- increases by , which in turn increases by
- increases by , which in turn increases by
- increases by , which is the same as
Hence, the equation used above is proven to be correct.
A bit of Neural Network
Imagine we have a neural network with the following architecture, where each neuron has a linear activation function:
Notation:
- is the input of the neural network,
- is the -th weight of the neuron in the -th layer,
- is the bias of the -th neuron in the -th layer,
- and is the activation value of the -th neuron in the -th layer.
All the parameters are assigned random values at initialization.
For simplicity, we can denote all the weights and biases in the neural network with and , respectively.
The neural network can essentially be expressed as a function with input and parameters , :
Now that we have a way to represent the neural network, we would also need a way to tell how well our neural network performed. Let's define the loss function as:
Once we have a loss function, we can also define our cost function :
This is simply the mean squared error of our neural network with parameters and .
To train the parameters of the neural network towards the local minima of the cost function , we can perform gradient descent at each training step:
Here, denotes the learning rate of the algorithm.
This essentially helps the descent of parameters (for each dimension) toward the local minima.
Below is a two-dimensional visualization of the gradient descent step on a contour plot.
Depending on our randomly initialized weights and biases, we could end up at different local minima. For instance, in the picture above, starting at a
would lead us to c
. However, starting from b
would lead us to d
instead.
Hopefully, everything is making sense at this point. This is simply a straightforward explanation of how a neural network is trained. It's pretty much all you need to get started with neural networks.
Backpropagation
Have you ever wondered how we could obtain the value of each derivative used in the gradient descent step?
Namely, and .
One widely used algorithm today to compute derivatives is Backpropagation. As the name implies, we propagate from the back to the front.
To demonstrate backpropagation, we'll utilize the same neural network architecture introduced at the beginning.
Therefore, at each training step, we need to find the partial derivative of with respect to:
Computation Graph
The neural network, is equivalent to:
And the loss function is:
How do we find the derivative of each parameter through this equation? Moreover, how do we represent this equation in our machine? The answer to this lies in the Computation Graph.
Why do we find the partial derivative of the loss function instead of the cost? We will use a more intuitive approach to find the partial derivative of the cost function. For each parameter , the partial derivative of the cost function with respect to is simply the average of the partial derivative of the loss function with respect to parameter for all examples. Formally, where is the number of training examples.
The computation graph allows us to efficiently represent the equations of the neural network. It breaks down the steps of a large equation into nodes and edges. Each step represented in the computation graph is an atomic operation that lets us use chain rule to find the derivative easily.
Let's look at a really trivial example. Consider the equation with parameters and being the equation we're trying to minimize. We can represent it in a computation graph as follows:
This is the computation graph of the given equation. Essentially, we've deconstructed the equation to resemble how a computer performs computations using binary operators.
With our computation graph, we can start minimizing by finding and .
We will be using backpropagation, starting from the last node of the graph.
- We will start with . What is ? It's simply because equals . If we increase with a small amount, , will also increase by .
- Next, we will find the derivative of with respect to , . We know that , because if we increase by , also increase by . With this, we can use chain rule to find . .
- Similarly to the previous step, for , we find too.
- Moving on to , to find , when we increase by how much does increases? The answer is , hence, is . Using the chain rule again, .
- Finally, we can also find . But this doesn't matter, as is an input rather than a parameter.
This is how we can find the and , using the computation graph. With those values, we can proceed to perform gradient descent on the parameters.
Returning to our example,
To minimize the loss function, let's represent the loss function as a computation graph:
For simplicity, we'll denote each node that isn't a leaf node with a different letter.
Then we find the derivative of each node.
Note: In the 19-th step, we used chain rule with one independent variable to find the derivative of .
And this is how we can find the derivative of each parameter in a neural network using backpropagation! I hope you have enjoyed reading it!