Published on

Backpropagation in Neural Network

This is a note I made for myself while trying to understand how to train a neural network. Although knowledge in calculus and artificial neural networks isn't necessarily required, having some familiarity with these topics is advisable. It's okay if you don't know any of those (or have forgotten); I will introduce the necessary concepts to understand backpropagation.

A bit of Calculus

What is derivative?

The derivative of yy with respect to xx, dydx\frac{dy}{dx}, represents the rate of change in yy for a small change in xx. In simpler terms, it's the ratio of the impact on yy as xx increases by a small value, ϵ\epsilon.

For example, let y=f(x)=2xy = f(x) = 2x. When x=1x=1, the value of yy is

f(x=1)=2(1)=2f(x = 1) = 2(1) = 2

If we add xx by a really small number, ϵ\epsilon, say 0.0010.001, how much would yy change?

f(1+0.001)=2.002f(1 + 0.001) = 2.002

The difference of f(x+0.001)f(x + 0.001) and f(x)f(x) is 0.0020.002. When we increased xx by 0.0010.001, yy increases by 0.0020.002, the ratio of impact is essentially 22, because 0.0020.002 is twice the value of 0.0010.001, so dydx=2\frac{dy}{dx} = 2.

This could also easily be shown using the definition of derivative:

dydx=limh0f(x+h)f(x)h\frac{dy}{dx} = \lim_{h \rightarrow 0} \frac{f(x+h) - f(x)}{h}

Chain Rule

Let's raise the difficulty level a bit.

Let y=f(g(x))y = f(g(x)), and we have the value of dgdx=3\frac{dg}{dx} = 3 and dydg=2\frac{dy}{dg} = 2 , what is dydx\frac{dy}{dx} now?

This can be seen as:

  • A small change on xx corresponds to a ratio of impact on gg of 33.
  • A small change on gg corresponds to a ratio of impact on yy of 22.

So, what is the ratio of impact for a small change in xx to yy?

It's just 32=63 \cdot 2 = 6. This is basically the chain rule of calculus.

While learning, I came across an intuitive example that makes easier to understand:

"If a car travels twice as fast as a bicycle and the bicycle is four times as fast as a walking man, then the car travels 2×4=82 × 4 = 8 times as fast as the man."

Formally ,when y=f(g(x))y=f(g(x))

dydx=dgdxdydg\frac{dy}{dx} = \frac{dg}{dx} \cdot \frac{dy}{dg}

Chain Rule with One Independent Variable

Let y=f(g(x),h(x))y = f(g(x), h(x)), assuming we have dgdx,fg,dhdx,fh\frac{dg}{dx}, \frac{\partial f}{\partial g}, \frac{dh}{dx}, \frac{\partial f}{\partial h}, what is dydx\frac{dy}{dx}?

We can find dydx\frac{dy}{dx} using the equation as follows:

dydx=dgdxfg+dhdxfh\frac{dy}{dx} = \frac{dg}{dx}\cdot \frac{\partial f}{\partial g} + \frac{dh}{dx} \cdot \frac{\partial f}{\partial h}

To understand why this holds true, let's examine the impact ratio of xx on both gg and hh separately.

When we increase xx by ϵ\epsilon,

  • gg increases by dgdxϵ\frac{dg}{dx} \cdot \epsilon, which in turn increases yy by (dgdxϵ)fg\left(\frac{dg}{dx} \cdot \epsilon\right)\cdot \frac{\partial f}{\partial g}
  • hh increases by dhdxϵ\frac{dh}{dx} \cdot \epsilon, which in turn increases yy by (dhdxϵ)fh\left(\frac{dh}{dx} \cdot \epsilon\right)\cdot \frac{\partial f}{\partial h}
  • yy increases by (dgdxϵ)fg+(dhdxϵ)fh\left(\frac{dg}{dx} \cdot \epsilon\right)\cdot \frac{\partial f}{\partial g} + \left(\frac{dh}{dx} \cdot \epsilon\right)\cdot \frac{\partial f}{\partial h}, which is the same as ϵ(dgdxfg+dhdxfh)\epsilon \left(\frac{dg}{dx}\cdot \frac{\partial f}{\partial g} + \frac{dh}{dx} \cdot \frac{\partial f}{\partial h}\right)

Hence, the equation used above is proven to be correct.

A bit of Neural Network

Imagine we have a neural network with the following architecture, where each neuron has a linear activation function:

computation-graph

Notation:

  • xx is the input of the neural network,
  • wj,k[i]w^{[i]}_{j, k} is the kk-th weight of the jj neuron in the ii-th layer,
  • bj[i]b^{[i]}_j is the bias of the jj-th neuron in the ii-th layer,
  • and aj[i]a^{[i]}_j is the activation value of the jj-th neuron in the ii-th layer.

All the parameters are assigned random values at initialization.

For simplicity, we can denote all the weights and biases in the neural network with W\textbf{W} and B{\textbf{B}}, respectively.

The neural network can essentially be expressed as a function ff with input xx and parameters W\textbf{W}, B\textbf{B}:

fW,B(x)f_{\textbf{W}, \textbf{B}}(x)

Now that we have a way to represent the neural network, we would also need a way to tell how well our neural network performed. Let's define the loss function LL as:

L(fW,B(x),y)=(yfW,B(x))2L(f_{\textbf{W}, \textbf{B}}(x), y) = (y - f_{\textbf{W}, \textbf{B}}(x))^2

Once we have a loss function, we can also define our cost function JJ:

J(W,B)=1mi=1mL(fW,B(xi),yi)J(\textbf{W}, \textbf{B}) = \frac{1}{m} \sum_{i=1}^m L(f_{\textbf{W}, \textbf{B}}(x^i), y^i)

This is simply the mean squared error of our neural network with parameters W\textbf{W} and B\textbf{B}.

To train the parameters of the neural network towards the local minima of the cost function JJ, we can perform gradient descent at each training step:

wj,k[i]=wj,k[i]αJwj,k[i]bj[i]=bj[i]αJbj[i]w^{[i]}_{j,k} = w^{[i]}_{j,k} - \alpha \frac{\partial J}{\partial w^{[i]}_{j,k}}\\ b^{[i]}_j = b^{[i]}_j - \alpha \frac{\partial J}{\partial b^{[i]}_j}

Here, α\alpha denotes the learning rate of the algorithm.

This essentially helps the descent of parameters (for each dimension) toward the local minima.

Below is a two-dimensional visualization of the gradient descent step on a contour plot.

computation-graph

Depending on our randomly initialized weights and biases, we could end up at different local minima. For instance, in the picture above, starting at a would lead us to c. However, starting from b would lead us to d instead.

Hopefully, everything is making sense at this point. This is simply a straightforward explanation of how a neural network is trained. It's pretty much all you need to get started with neural networks.

Backpropagation

Have you ever wondered how we could obtain the value of each derivative used in the gradient descent step?

wj,k[i]=wj,k[i]αJwj,k[i]bj[i]=bj[i]αJbj[i]w^{[i]}_{j,k} = w^{[i]}_{j,k} - \alpha \frac{\partial J}{\partial w^{[i]}_{j,k}}\\ b^{[i]}_j = b^{[i]}_j - \alpha \frac{\partial J}{\partial b^{[i]}_j}

Namely, Lwj,k[i]\frac{\partial L}{\partial w^{[i]}_{j,k}} and Lbj[i]\frac{\partial L}{\partial b^{[i]}_j}.

One widely used algorithm today to compute derivatives is Backpropagation. As the name implies, we propagate from the back to the front.

To demonstrate backpropagation, we'll utilize the same neural network architecture introduced at the beginning.

Therefore, at each training step, we need to find the partial derivative of LL with respect to:

w1,1[1],b1[1],w1,1[2],b1[2],w2,1[2],b2[2],w1,1[3],w1,2[3],b1[3]w^{[1]}_{1,1}, b^{[1]}_1, w^{[2]}_{1,1}, b^{[2]}_1, w^{[2]}_{2,1}, b^{[2]}_2, w^{[3]}_{1,1}, w^{[3]}_{1,2}, b^{[3]}_1

Computation Graph

The neural network, fW,B(x)f_{\textbf{W},\textbf{B}}(x) is equivalent to:

fW,B(x)=w1,1[3](w1,1[2](w1,1[1]x+b1[1])+b1[2])+w1,2[3](w2,1[2](w1,1[1]x+b1[1])+b2[2])+b1[3]\begin{align*} f_{\textbf{W},\textbf{B}}(x) =& w^{[3]}_{1,1} \left( w^{[2]}_{1, 1} \left(w^{[1]}_{1,1}x + b^{[1]}_1 \right) + b^{[2]}_1 \right) +\\ & w^{[3]}_{1,2} \left( w^{[2]}_{2, 1} \left(w^{[1]}_{1,1}x + b^{[1]}_1 \right) + b^{[2]}_2 \right) +\\ & b^{[3]}_1 \end{align*}

And the loss function is:

L(fW,B(x),y)=(yfW,B(x))2L(f_{\textbf{W}, \textbf{B}}(x), y) = (y - f_{\textbf{W}, \textbf{B}}(x))^2

How do we find the derivative of each parameter through this equation? Moreover, how do we represent this equation in our machine? The answer to this lies in the Computation Graph.

Why do we find the partial derivative of the loss function instead of the cost? We will use a more intuitive approach to find the partial derivative of the cost function. For each parameter ii, the partial derivative of the cost function JJ with respect to ii is simply the average of the partial derivative of the loss function LL with respect to parameter ii for all examples. Formally, Ji=1mk=0m1Li\frac{\partial J}{\partial i} = \frac{1}{m} \sum_{k=0}^{m-1} \frac{\partial L}{\partial i} where mm is the number of training examples.

The computation graph allows us to efficiently represent the equations of the neural network. It breaks down the steps of a large equation into nodes and edges. Each step represented in the computation graph is an atomic operation that lets us use chain rule to find the derivative easily.

Let's look at a really trivial example. Consider the equation y=mx+cy = mx + c with parameters mm and cc being the equation we're trying to minimize. We can represent it in a computation graph as follows:

computation-graph

This is the computation graph of the given equation. Essentially, we've deconstructed the equation to resemble how a computer performs computations using binary operators.

With our computation graph, we can start minimizing yy by finding ym\frac{\partial y}{\partial m} and yc\frac{\partial y}{\partial c}.

We will be using backpropagation, starting from the last node of the graph.

  1. We will start with mx+cmx + c. What is y(mx+c)\frac{\partial y}{\partial (mx + c)} ? It's simply 11 because mx+cmx + c equals yy. If we increase mx+cmx + c with a small amount, ϵ\epsilon, yy will also increase by ϵ\epsilon.
  2. Next, we will find the derivative of yy with respect to cc, yc\frac{\partial y}{\partial c}. We know that (mx+c)c=1\frac{\partial (mx + c)}{\partial c} = 1, because if we increase cc by ϵ\epsilon, mx+cmx + c also increase by ϵ\epsilon. With this, we can use chain rule to find yc\frac{\partial y}{\partial c}. yc=(mx+c)cy(mx+c)=11=1\frac{\partial y}{\partial c} = \frac{\partial (mx + c)}{\partial c} \cdot \frac{\partial y}{\partial (mx + c)} = 1 \cdot 1 = 1.
  3. Similarly to the previous step, for mxmx, we findy(mx)=1\frac{\partial y}{\partial (mx)} = 1 too.
  4. Moving on to mm, to find ym\frac{\partial y}{\partial m}, when we increase mm by ϵ\epsilon how much does mxmx increases? The answer is xx, hence, (mx)m\frac{\partial (mx)}{\partial m} is xx. Using the chain rule again, ym=(mx)my(mx)=x1=x\frac{\partial y}{\partial m} = \frac{\partial (mx)}{\partial m} \cdot \frac{\partial y}{\partial (mx)} = x \cdot 1 = x.
  5. Finally, we can also find yx=m\frac{\partial y}{\partial x} = m. But this doesn't matter, as xx is an input rather than a parameter.

This is how we can find the ym\frac{\partial y}{\partial m} and yc\frac{\partial y}{\partial c}, using the computation graph. With those values, we can proceed to perform gradient descent on the parameters.

Returning to our example,

To minimize the loss function, let's represent the loss function as a computation graph:

L(fW,B(x),y)=(yfW,B(x))2L(f_{\textbf{W}, \textbf{B}}(x), y) = (y - f_{\textbf{W}, \textbf{B}}(x))^2

For simplicity, we'll denote each node that isn't a leaf node with a different letter.

computation-graph

Then we find the derivative of each node.

computation-graph

Note: In the 19-th step, we used chain rule with one independent variable to find the derivative of Ll\frac{\partial L}{\partial l}.

And this is how we can find the derivative of each parameter in a neural network using backpropagation! I hope you have enjoyed reading it!