# Backpropagation calculus | Deep learning, chapter 4

The hard assumption here is that you’ve watched part 3, giving an intuitive walkthrough of the backpropagation algorithm. Here, we get a bit more formal and dive into the relevant calculus. It’s normal for this to be a little confusing, so the mantra to regularly pause and ponder certainly applies as much here as anywhere else. Our main goal is to show how people in machine learning commonly think about the chain rule from the calculus in the context of networks, which has a different feel for how much most introductory calculus courses approach the subject. For those of you uncomfortable with the relevant calculus, I do have a whole series on the topic. Let’s just start off with an extremely simple network, one where each layer has a single neuron in it. So this particular network is determined by 3 weights and 3 biases, and our goal is to understand how sensitive the cost function is to these variables. That way we know which adjustments to these terms is going to cause the most efficient decrease to the cost function. And we’re just focus on the connection between the last two neurons. Let’s label the activation of that last neuron a with a superscript L, indicating which layer it’s in, so the activation of this previous neuron is a^(L-1). There are not exponents, they’re just a way of indexing what we’re talking about, since I want to save subscripts for different indices later on. Let’s say that the value we want this last activation to be for a given training example is y. For example, y might be 0 or 1. So the cost of this simple network for a single training example is (a^(L) – y)^2. We’ll denote the cost of this one training example as C_0. As a reminder, this last activation is determined by a weight, which I’m going to call w^(L) times the previous neuron’s activation, plus some bias, which I’ll call b^(L), then you pump that through some special nonlinear function like a sigmoid or a ReLU. It’s actually going to make things easier for us if we give a special name to this weighted sum, like z, with the same superscript as the relevant activations. So there are a lot of terms. And a way you might conceptualize this is that the weight, the previous activation, and the bias altogether are used to compute z, which in turn lets us compute a, which finally, along with the constant y, let us compute the cost. And of course, a^(L-1) is influenced by its own weight and bias, and such. But we are not gonna focus on that right now. All of these are just numbers, right? And it can be nice to think of each one as having its own little number line. Our first goal is to understand how sensitive the cost function is to small changes in our weight w^(L). Or phrased differently, what’s the derivative of C with respect to w^(L). When you see this “∂w” term, think of it as meaning “some tiny nudge to w”, like a change by 0.01. And think of this “∂C” term as meaning “whatever the resulting nudge to the cost is”. What we want is their ratio. Conceptually, this tiny nudge to w^(L) causes some nudge to z^(L) which in turn causes some change to a^(L), which directly influences the cost. So we break this up by first looking at the ratio of a tiny change to z^(L) to the tiny change in w^(L). That is, the derivative of z^(L) with respect to w^(L). Likewise, you then consider the ratio of a change to a^(L) to the tiny change in z^(L) that caused it, as well as the ratio between the final nudge to C and this intermediate nudge to a^(L). This right here is the chain rule, where multiplying together these three ratios gives us the sensitivity of C to small changes in w^(L). So on screen right now, there’s kinda lot of symbols, so take a moment to make sure it’s clear what they all are, because now we are gonna compute the relevant derivatives. The derivative of C with respect to a^(L) works out to be 2(a^(L) – y). Notice, this means that its size is proportional to the difference between the network’s output, and the thing we want it to be. So if that output was very different, even slight changes stand to have a big impact on the cost function. The derivative of a^(L) with respect to z^(L) is just the derivative of our sigmoid function, or whatever nonlinearity you choose to use. And the derivative of z^(L) with respect to w^(L), in this case comes out just to be a^(L-1). Now I don’t know about you, but I think it’s easy to get stuck head-down in these formulas without taking a moment to sit back and remind yourself what they all actually mean. In the case of this last derivative, the amount that a small nudge to this weight influences the last layer depends on how strong the previous neuron is. Remember, this is where that “neurons that fire together wire together” idea comes in. And all of this is the derivative with respect to w^(L) only of the cost for a specific training example. Since the full cost function involves averaging together all those costs across many training examples, its derivative requires averaging this expression that we found over all training examples. And of course that is just one component of the gradient vector, which itself is built up from the partial derivatives of the cost function with respect to all those weights and biases. But even though it was just one of those partial derivatives we need, it’s more than 50% of the work. The sensitivity to the bias, for example, is almost identical. We just need to change out this ∂z/∂w term for a ∂z/∂b, And if you look at the relevant formula, that derivative comes to be 1. Also, and this is where the idea of propagating backwards comes in, you can see how sensitive this cost function is to the activation of the previous layer; namely, this initial derivative in the chain rule expansion, the sensitivity of z to the previous activation, comes out to be the weight w^(L). And again, even though we won’t be able to directly influence that activation, it’s helpful to keep track of, because now we can just keep iterating this chain rule idea backwards to see how sensitive the cost function is to previous weights and to previous biases. And you might think this is an overly simple example, since all layers just have 1 neuron, and things are just gonna get exponentially more complicated in the real network. But honestly, not that much changes when we give the layers multiple neurons. Really it’s just a few more indices to keep track of. Rather than the activation of a given layer simply being a^(L), it’s also going to have a subscript indicating which neuron of that layer it is. Let’s go ahead and use the letter k to index the layer (L-1), and j to index the layer (L). For the the cost, again we look at what the desired output is. But this time we add up the squares of the differences between these last layer activations and the desired output. That is, you take a sum over (a_j^(L) – y_j)^2 Since there are a lot more weights, each one has to have a couple more indices to keep track of where it is. So let’s call the weight of the edge connecting this k-th neuron to the j-th neuron w_{jk}^(L). Those indices might feel a little backwards at first, but it lines up with how you’d index the weight matrix that I talked about in the Part 1 video. Just as before, it’s still nice to give a name to the relevant weighted sum, like z, so that the activation of the last layer is just your special function, like the sigmoid, applied to z. You can kinda see what I mean, right? These are all essentially the same equations we had before in the one-neuron-per-layer case; it just looks a little more complicated. And indeed, the chain-rule derivative expression describing how sensitive the cost is to a specific weight looks essentially the same. I’ll leave it to you to pause and think about each of these terms if you want. What does change here, though, is the derivative of the cost with respect to one of the activations in the layer (L-1). In this case, the difference is the neuron influences the cost function through multiple paths. That is, on the one hand, it influences a_0^(L), which plays a role in the cost function, but it also has an influence on a_1^(L), which also plays a role in the cost function. And you have to add those up. And that… well that is pretty much it. Once you know how sensitive the cost function is to the activations in this second to last layer, you can just repeat the process for all the weights and biases feeding into that layer. So pat yourself on the back! If this all of these makes sense, you have now looked deep into the heart of backpropagation, the workhorse behind how neural networks learn. These chain rule expressions give you the derivatives that determine each component in the gradient that helps minimize the cost of the network by repeatedly stepping downhill. Hhhhpf. If you sit back and think about all that, that’s a lot of layers of complexity to wrap your mind around. So don’t worry if it takes time for your mind to digest it all.

100 comments
3Blue1Brown says:

Two things worth adding here:
1) In other resources and in implementations, you'd typically see these formulas in some more compact vectorized form, which carries with it the extra mental burden to parse the Hadamard product and to think through why the transpose of the weight matrix is used, but the underlying substance is all the same.

2) Backpropagation is really one instance of a more general technique called "reverse mode differentiation" to compute derivatives of functions represented in some kind of directed graph form.

Srijan says:

So, I get that the desired output(y) for the layer of neurons in the output layer can either be a 0 or a 1. But, what is the desired output(y) when calculating the gradients for the second to last layer of neurons? What activation do we actually desire for the layer behind the activation layer?

Ivens da Costa says:

Simply fantastic!!!!

Samarth Singla says:

The amount of help you are providing is nothing short of amazing.

Guilherme Viveiros says:

This 4-video series was very helpful and your explanations are awesome!
Thank you!

Prashamsh Takkalapally says:

One of the best lectures I have ever heard. Great explanation of NN, cost functions, activation functions etc. Now I understand NN far far better…(P.S. I saw previous videos Part 1, 2,3 as well)

arli says:

is the cost function here the loss function or the averaged loss function?

Zetapology says:

Without this I never would've been able to make my first neural network, even though all it did was learn how to respond to Rock Paper Scissors when it already knows which one you're going to put (basically overfitting is the goal)

mukund holo says:

this is math at the best and art too at the highest. the grace of the animation, the subtle music, the perfectly paced narration and the wonderful colour scheme! math and art or let's say math is art!

Subramanya M says:

Thanks for the wonderful lectures. Expecting more lectures in this field..

Hasindu Piyumantha says:

Great video series… Thanks a lot for explaining something very complex so nicely…

brunon554 says:

Wow, so clear, thanks 😀
I'm not sure why the derivative of z(L) is a(L-1) though :/
Could anyone explain me? 🙂

Michael Corley says:

It's great that you started it backwards but when I started to program it I realized something. What is a^(L-1) when computing the first layer? Is it simply the value from the inputs?

z^((L))=w^((L)) a^((L-1))+b^((L))

a^((L))=f(z^L)

Access2Music says:

Thank you so so so much for making this series. Within an hour, I feel that I have learned a good deal about Neural Networks. You are amazing!

Billy Kotsos says:

This video is art

Samia Zaman says:

Thank you. Love it.

남재현 says:

1:55 BL?

duncpol says:

Is there any learning material available on the internet for a simple neural net which goes step by step, computing actual values (cost functions, derivations) for all weights, biases and iterations? That would be very practical.

Sarim Mehdi says:

you are not a good teacher. You failed to extrapolate from the simple example to the general one. You should give an example with a 2 layer network with the cost function and activation function and then show, explicitly by writing down, how the derivatives go all the way back.

Tn Inventor says:

now i know the basic i will let it sink then watch how to code it using python

Luis Ka says:

4:10 haha i laughed for no reason at the thoo

Jagdeepak Rawat says:

Hey , can you make a video for rnn? It would be of great help.

Sergey Piterman says:

This is SO MUCH computation. But amazing explanation, can't wait to implement.

Grunion Shaftoe says:

This video doesn't actually suggest how one chooses a value to add to the weights, and the propagation seems to move forward to the first layer only – how are alterations added to the second and third layers ?

Dream Worker 65524 says:

You easily won one more subscriber.

Alex Berk says:

Maybe one day i will be smart enough to understand it all

Ricardo Solano says:

Amazing stuff. Just… GREAT. I Cannot thank you enough!

Achraf Saad says:

At like 8:25, why is C0 being the sum going from j = 0 to nL-1 being the number of neurons of the last hidden layer, and then j being used in the output's subscript (Aj)? is that a mistake or am I missing something

Jeff Hubbard says:

So with stochastic gradient descent would you only change some of the weight values each iteration in the training phase.

Chris H says:

Wow, this took a long time to get my head around fully, but I was finally able to understand it enough to implement my own version of backpropagation from scratch thanks to this video! Neural networks are something I've wanted to get into for a while and I'm really grateful for these wonderful in-depth explanations!

Sreesha Srinivasan Kuruvadi says:

This is epic !

Sanjay Thorat says:

@3Blue1Brown, @9:25, I think wl+1 should be just wl. Please confirm

Jonathan Kane says:

Are there anymore coming in the series? I found this very helpful.

Zachary Thatcher says:

That unexplained little formula addition at the end 9:30 showing what the partial derivative of the cost function with respect to the current node and layer when the current layer is not the output later really messed with me. In typical notation, that's lower case Delta. Correct?

Gustavo Rocha says:

You are genius.

Juuso Korhonen says:

I have a problem with my neural net, which I build from scratch following these videos. Most of the time my net gives pretty sure answers like [0.999987, 0.000323] (until now I've only tested with self-created data like input [1,2,3,4] should give [1,0] as an output), but sometimes with different initializations of weights and biases the training ends with feedforward now giving some strange answers like [0.004, 0.000003]. There's still clear distinction with the probability of the right answer and the wrong answer, but it is nowhere near the optimal sure answer [1,0]. What is going on here? Is it that my gradient descent finds an local minimum which gives as an output the said [0.004, 0.000003] and gets stuck there? Is this a common problem with neural nets? Should I try to find initial configurations for which the local minimum gives the least error?

Gaurang Mohta says:

This is the best explaination to Chain Rule I've ever heard!!

aam dae says:

Hope to see soon one more episode on the same thing in matrix notation which would make it more sensible to relate to actual implementation

Mrrajender2801 says:

Many guys claim to know. Some guys actually know. But only one guy actually knows and can explain to his grandma as well with very beautiful animations. You are that ONE !!!

Louis Emery says:

I wish I saw this video much earlier since I'm good at chain rule and also optimization problems. I attended a lecture on neural networks in 1984. I didn't really understand how one can determine weights without fitting. Looked to me like back propagation was a swindle until I saw this video. Now I can show my friends using a couple of lines on the whiteboard.

David Okao says:

This is so beautiful

Danil Kutny says:

I have been trying to understand, how to host a 'hello_world' python server for about a week and I still don't understand. I has watched your 4 videos a few times and make my own neural network that can understand the world. Man, I wish people who teaches were at least 10% as good in teaching as you

Virajkumar Patel says:

Thank you for such a great video!

basir sedighi says:

i watched, i learned , i became a patreon * meme of Fry saying "have my money "***

Nepali Podcaster says:

Just about an hour ago, I was totally alien to AI guys especially when they said "machine learns". Hats off to your selflessness making even a medico able to understand how a machine actually learns, relatively easier when I compared it to natural neural networks in our nervous system. Mark my words, with raging AI in healthcare, your videos will be a connecting link for someone away from AI, to know about how AI works.

Noah K says:

Hey for all of you getting discouraged because you don’t understand this – that was me last year. I went and taught myself derivatives and came back to try again and suddenly I understand everything. It’s such an amazing feeling to see that kind of work pay off. Don’t give up kiddos

VinTechTalk says:

going through your video is like meditation… a blissful experience.. thank you so much!

Marvin J says:

brilliantly clear! love it! It really helps!

Arnold Marsh says:

Help a lot when I doing my ML homework. thanks <3

shismohammad mulla says:

Need more videos on AI 😃

Erika Gutierrez says:

best series!! Thanks for this material!

Ahmed Elsayes says:

No words can express my admiration with your work

Pranav Chaturvedi says:

Congratulations fellow learner on making this far. You are/are going to be a good Machine Learning Engineer.(I am just telling that myself.)

Gellért Tóth says:

I understand what happens on the first level, but then what? We have the derivative for the a(l-1)-s and we recalculate another cost func, but now with (dC/da(l-1))^2 instead of (a(l)-y)^2?

Ophir Gal says:

I would be super interested in a video of you explaining how you make your videos 🙂

Chris Hansen says:

Commenting to help you with the youtube algorithm because these videos are great

JAM JAM says:

I love this channel. Thank you

Hago ASL says:

this was really good

Karthik Bodapati says:

Awesome !, This channel is unbelivably good. Thanks a lot man

kim khanh Le says:

Woww, now I understand

UUMatter_010 says:

Im currently in 11. Grade learning Calculus and this Video gave me an awesome "WTF it makes sense" Moments, had to think about it for some time tho. Awesome Video!

Analytical1 says:

Can someone explain to me how to continue updating the weights? To update the weights before the output layer you do ErrorFunc' * ActivationFunc' * a(L-1) = nudges to W(L)

From the video:
∂C/∂a(L) * ∂a(L)/∂z(L) * ∂z(L)/∂w(L)

How do you extend the chain? Do you multiply that chain with W(L-1), OR W(L-1) and a(L-2) as well since the previous update ended in multiplying by a(L-1).

So:
ErrorFunc' * ActivationFunc' * a(L-1) * W(L-1) * a(L-2) = nudges to W(L-1)
In notation:
∂C/∂a(L) * ∂a(L)/∂z(L) * ∂z(L)/∂w(L) *  ∂a(L-2)/∂z(L-1) * ∂z/∂w(L-1)

Giorgio Acquati says:

This is easily my favorite youtube channel! Why not continue the series on something like convolutional neural networks?

Omar Alsabbagh says:

Amazing Explanation.

Jar汁機 says:

Number of times I have watched this video before understanding

cheeseman says:

I just learned chain rule in high school last week and I'm glad it has a real life application.

Omar CHIDA says:

My brain neurons stopped firing after watching this series xD.
Just a joke you did great job explaining it and I understand it very well. Thanks for making these videos.

Alex Talbot says:

Completely understood the logic, but today I find out just how bad my maths skills are.

agpxnet says:

Great explanation, compliments.

Sriprad Potukuchi says:

I finally worked my way through the series! It really helps to understand the underlying mechanics of neural nets rather than just copy-pasting formulas into your code. I am making my own neural net with this knowledge. Wish me luck!
Thank you, Grant, for these amazing videos.

Justin Reusnow says:

Fantastic series. Please consider making one on convolutional neural networks as well! Your style of delivery would make it more perceptible than most sources.

Logic Facts says:

born to be useless moron explains simple things as complicated and messy as possible. this is rule valid for internet as general

CiniCraft says:

MY_BRAIN.EXE HAS STOPPED RESPONDING, CHECK THE ERROR LOG FOR MORE DETAILS

Aron Highgrove says:

Too many colors, too much movements. You need to cut it down to the essential ideas, and leave further "enrichements" to other videos or additional notes.

Qutubkhan Vajihi says:

This is unbelievably good! Thank you so much.

OrangeXenon54 says:

This saved my life and made so much sense! Why can't more teachers be ACTUAL teachers like you instead of just assuming you know everything?!

Thush Ish says:

Sorry just 1 quick question:

If many instances of the training data is used to calculate the cost function, how do we know the value of the activation of the previous layer?

I'm approaching this as a way to change the weight matrix from the cost and the output, so if that it incorrect, please let me know.
Thank you for your support!

Jus Lee says:

best explanation ever!

Tinil0 says:

That feeling when you never took past college algebra and yet spend hours watching 3Blue1Brown <.<

Ben Yonas says:

My only question is what do you do with the partial derivatives. Where do you go from there?

Aditya Mishra says:

Clicky stuff!!!

Pedro Rodrigues says:

Thank you!

Justin Moore says:

The most simple answer is to squish the bug. Eh from dust to dust. Anyone have any good memes?

qwert 9203 says:

9:28 sums up the whole thing

William Romero Avilá says:

Wombat-_l

Kevin Oduor says:

backpropagation is hard for me to understand.why not just discard the "wrong " neurons.

Prottoy Nahian says:

how does one minimize a function if the weights and biases keep changing?

Salamanka says:

Shit, this episode screws me

Salamanka says:

I watched this video again, do not agree with you, I think the formula of 03:45 was meaningless, so the whole video completely meaningless after that.

Kreavita says:

i made a neural network of this kind in c++ from scratch and let it train on the mnist dataset with mini batches (100 samples) but it is training for 3 hours already and didnt even made it to the half of the dataset. is this normal or is there likely an error in the code? im running it on one thread and my cpu isnt the best (haswell i5)

Muhammed Aydoğan says:

thatssss pretty cool and helped me a lot. Thankssss. (Im parsssselmouth)

the impractical transhumanist says:

looks like the A.I. researchers are wide of the mark on how intelligence works.

Alex Jones says:

Beautiful! Thank you so much

Archit Jain says:

Solve a numberical

Zauber Flecks says:

You are the best lecturer that ever existed

Hans Dieter says:

I would definitely hire you as a teacher. The content is easy to grasp, although I am not a native english speeker.

aulas4you says:

great content! Very good class!

A Random JEE Aspirant says:

What did it cost ?

just a function.

Manoj Pawar SJ says:

When will you make video about CNN RNN