- Published on

# Backpropagation in Neural Network

This is a note I made for myself while trying to understand how to train a neural network. Although knowledge in calculus and artificial neural networks isn't necessarily required, having some familiarity with these topics is advisable. It's okay if you don't know any of those (or have forgotten); I will introduce the necessary concepts to understand backpropagation.

## A bit of Calculus

### What is derivative?

The derivative of $y$ with respect to $x$, $\frac{dy}{dx}$, represents the rate of change in $y$ for a small change in $x$. In simpler terms, it's the ratio of the impact on $y$ as $x$ increases by a small value, $\epsilon$.

For example, let $y = f(x) = 2x$. When $x=1$, the value of $y$ is

$f(x = 1) = 2(1) = 2$

If we add $x$ by a really small number, $\epsilon$, say $0.001$, how much would $y$ change?

$f(1 + 0.001) = 2.002$

The difference of $f(x + 0.001)$ and $f(x)$ is $0.002$. When we increased $x$ by $0.001$, $y$ increases by $0.002$, the ratio of impact is essentially $2$, because $0.002$ is twice the value of $0.001$, so $\frac{dy}{dx} = 2$.

This could also easily be shown using the definition of derivative:

$\frac{dy}{dx} = \lim_{h \rightarrow 0} \frac{f(x+h) - f(x)}{h}$

### Chain Rule

Let's raise the difficulty level a bit.

Let $y = f(g(x))$, and we have the value of $\frac{dg}{dx} = 3$ and $\frac{dy}{dg} = 2$ , what is $\frac{dy}{dx}$ now?

This can be seen as:

- A small change on $x$ corresponds to a ratio of impact on $g$ of $3$.
- A small change on $g$ corresponds to a ratio of impact on $y$ of $2$.

So, what is the ratio of impact for a small change in $x$ to $y$?

It's just $3 \cdot 2 = 6$. This is basically the chain rule of calculus.

While learning, I came across an intuitive example that makes easier to understand:

"If a car travels twice as fast as a bicycle and the bicycle is four times as fast as a walking man, then the car travels $2 × 4 = 8$ times as fast as the man."

Formally ,when $y=f(g(x))$

$\frac{dy}{dx} = \frac{dg}{dx} \cdot \frac{dy}{dg}$

### Chain Rule with One Independent Variable

Let $y = f(g(x), h(x))$, assuming we have $\frac{dg}{dx}, \frac{\partial f}{\partial g}, \frac{dh}{dx}, \frac{\partial f}{\partial h}$, what is $\frac{dy}{dx}$?

We can find $\frac{dy}{dx}$ using the equation as follows:

$\frac{dy}{dx} = \frac{dg}{dx}\cdot \frac{\partial f}{\partial g} + \frac{dh}{dx} \cdot \frac{\partial f}{\partial h}$

To understand why this holds true, let's examine the impact ratio of $x$ on both $g$ and $h$ separately.

When we increase $x$ by $\epsilon$,

- $g$ increases by $\frac{dg}{dx} \cdot \epsilon$, which in turn increases $y$ by $\left(\frac{dg}{dx} \cdot \epsilon\right)\cdot \frac{\partial f}{\partial g}$
- $h$ increases by $\frac{dh}{dx} \cdot \epsilon$, which in turn increases $y$ by $\left(\frac{dh}{dx} \cdot \epsilon\right)\cdot \frac{\partial f}{\partial h}$
- $y$ increases by $\left(\frac{dg}{dx} \cdot \epsilon\right)\cdot \frac{\partial f}{\partial g} + \left(\frac{dh}{dx} \cdot \epsilon\right)\cdot \frac{\partial f}{\partial h}$, which is the same as $\epsilon \left(\frac{dg}{dx}\cdot \frac{\partial f}{\partial g} + \frac{dh}{dx} \cdot \frac{\partial f}{\partial h}\right)$

Hence, the equation used above is proven to be correct.

## A bit of Neural Network

Imagine we have a neural network with the following architecture, where each neuron has a linear activation function:

Notation:

- $x$ is the input of the neural network,
- $w^{[i]}_{j, k}$ is the $k$-th weight of the $j$ neuron in the $i$-th layer,
- $b^{[i]}_j$ is the bias of the $j$-th neuron in the $i$-th layer,
- and $a^{[i]}_j$ is the activation value of the $j$-th neuron in the $i$-th layer.

All the parameters are assigned random values at initialization.

For simplicity, we can denote all the weights and biases in the neural network with $\textbf{W}$ and ${\textbf{B}}$, respectively.

The neural network can essentially be expressed as a function $f$ with input $x$ and parameters $\textbf{W}$, $\textbf{B}$:

$f_{\textbf{W}, \textbf{B}}(x)$

Now that we have a way to represent the neural network, we would also need a way to tell how well our neural network performed. Let's define the loss function $L$ as:

$L(f_{\textbf{W}, \textbf{B}}(x), y) = (y - f_{\textbf{W}, \textbf{B}}(x))^2$

Once we have a loss function, we can also define our cost function $J$:

$J(\textbf{W}, \textbf{B}) = \frac{1}{m} \sum_{i=1}^m L(f_{\textbf{W}, \textbf{B}}(x^i), y^i)$

This is simply the mean squared error of our neural network with parameters $\textbf{W}$ and $\textbf{B}$.

To train the parameters of the neural network towards the local minima of the cost function $J$, we can perform gradient descent at each training step:

$w^{[i]}_{j,k} = w^{[i]}_{j,k} - \alpha \frac{\partial J}{\partial w^{[i]}_{j,k}}\\ b^{[i]}_j = b^{[i]}_j - \alpha \frac{\partial J}{\partial b^{[i]}_j}$

Here, $\alpha$ denotes the learning rate of the algorithm.

This essentially helps the descent of parameters (for each dimension) toward the local minima.

Below is a two-dimensional visualization of the gradient descent step on a contour plot.

Depending on our randomly initialized weights and biases, we could end up at different local minima. For instance, in the picture above, starting at `a`

would lead us to `c`

. However, starting from `b`

would lead us to `d`

instead.

Hopefully, everything is making sense at this point. This is simply a straightforward explanation of how a neural network is trained. It's pretty much all you need to get started with neural networks.

## Backpropagation

Have you ever wondered how we could obtain the value of each derivative used in the gradient descent step?

$w^{[i]}_{j,k} = w^{[i]}_{j,k} - \alpha \frac{\partial J}{\partial w^{[i]}_{j,k}}\\ b^{[i]}_j = b^{[i]}_j - \alpha \frac{\partial J}{\partial b^{[i]}_j}$

Namely, $\frac{\partial L}{\partial w^{[i]}_{j,k}}$ and $\frac{\partial L}{\partial b^{[i]}_j}$.

One widely used algorithm today to compute derivatives is **Backpropagation**. As the name implies, we propagate from the back to the front.

To demonstrate backpropagation, we'll utilize the same neural network architecture introduced at the beginning.

Therefore, at each training step, we need to find the partial derivative of $L$ with respect to:

$w^{[1]}_{1,1}, b^{[1]}_1, w^{[2]}_{1,1}, b^{[2]}_1, w^{[2]}_{2,1}, b^{[2]}_2, w^{[3]}_{1,1}, w^{[3]}_{1,2}, b^{[3]}_1$

### Computation Graph

The neural network, $f_{\textbf{W},\textbf{B}}(x)$ is equivalent to:

And the loss function is:

$L(f_{\textbf{W}, \textbf{B}}(x), y) = (y - f_{\textbf{W}, \textbf{B}}(x))^2$

How do we find the derivative of each parameter through this equation? Moreover, how do we represent this equation in our machine? The answer to this lies in the **Computation Graph**.

Why do we find the partial derivative of the loss function instead of the cost? We will use a more intuitive approach to find the partial derivative of the cost function. For each parameter $i$, the partial derivative of the cost function $J$ with respect to $i$ is simply the average of the partial derivative of the loss function $L$ with respect to parameter $i$ for all examples. Formally, $\frac{\partial J}{\partial i} = \frac{1}{m} \sum_{k=0}^{m-1} \frac{\partial L}{\partial i}$ where $m$ is the number of training examples.

The computation graph allows us to efficiently represent the equations of the neural network. It breaks down the steps of a large equation into nodes and edges. Each step represented in the computation graph is an atomic operation that lets us use chain rule to find the derivative easily.

Let's look at a really trivial example. Consider the equation $y = mx + c$ with parameters $m$ and $c$ being the equation we're trying to minimize. We can represent it in a computation graph as follows:

This is the computation graph of the given equation. Essentially, we've deconstructed the equation to resemble how a computer performs computations using binary operators.

With our computation graph, we can start minimizing $y$ by finding $\frac{\partial y}{\partial m}$ and $\frac{\partial y}{\partial c}$.

We will be using backpropagation, starting from the last node of the graph.

- We will start with $mx + c$. What is $\frac{\partial y}{\partial (mx + c)}$ ? It's simply $1$ because $mx + c$ equals $y$. If we increase $mx + c$ with a small amount, $\epsilon$, $y$ will also increase by $\epsilon$.
- Next, we will find the derivative of $y$ with respect to $c$, $\frac{\partial y}{\partial c}$. We know that $\frac{\partial (mx + c)}{\partial c} = 1$, because if we increase $c$ by $\epsilon$, $mx + c$ also increase by $\epsilon$. With this, we can use chain rule to find $\frac{\partial y}{\partial c}$. $\frac{\partial y}{\partial c} = \frac{\partial (mx + c)}{\partial c} \cdot \frac{\partial y}{\partial (mx + c)} = 1 \cdot 1 = 1$.
- Similarly to the previous step, for $mx$, we find$\frac{\partial y}{\partial (mx)} = 1$ too.
- Moving on to $m$, to find $\frac{\partial y}{\partial m}$, when we increase $m$ by $\epsilon$ how much does $mx$ increases? The answer is $x$, hence, $\frac{\partial (mx)}{\partial m}$ is $x$. Using the chain rule again, $\frac{\partial y}{\partial m} = \frac{\partial (mx)}{\partial m} \cdot \frac{\partial y}{\partial (mx)} = x \cdot 1 = x$.
- Finally, we can also find $\frac{\partial y}{\partial x} = m$. But this doesn't matter, as $x$ is an input rather than a parameter.

This is how we can find the $\frac{\partial y}{\partial m}$ and $\frac{\partial y}{\partial c}$, using the computation graph. With those values, we can proceed to perform gradient descent on the parameters.

Returning to our example,

To minimize the loss function, let's represent the loss function as a computation graph:

$L(f_{\textbf{W}, \textbf{B}}(x), y) = (y - f_{\textbf{W}, \textbf{B}}(x))^2$

For simplicity, we'll denote each node that isn't a leaf node with a different letter.

Then we find the derivative of each node.

Note: In the 19-th step, we used chain rule with one independent variable to find the derivative of $\frac{\partial L}{\partial l}$.

And this is how we can find the derivative of each parameter in a neural network using backpropagation! I hope you have enjoyed reading it!