Forward Propagation, Backward Propagation, and Computational Graphs - English Version

Jun '20

Andreas_Terzis

small typo: “requried” -> “required”

Thanks again for the great resource you’ve put together!

Jun '20

StevenJokes

n×m
How to compute the memory footprint? What is in the memory?
Computational Graph of Backpropagation? Try to track all latest two times’ calculate.
Now, I can’t. How to parallelly train?
Good for updating quickly.Easy to fall into a local extremum.

Jul '20

Why does the backpropagation (I mean the way of calculating gradients) have to go in the opposite direction of forward propagation? Can’t we calculate gradients in the forward propagation, which cost less memory?

3 replies

Jul '20 ▶ RaphaelCS

StevenJokes

The purpose of backward propagation is to get near to, even same as the labels we gave.
We only calulate gradients when we know the distance of prediction and reality.
This is the backward propagation what really does.

Jul '20 ▶ RaphaelCS

goldpiggy

Hi @RaphaelCS, great question! Please find more details at https://mxnet.apache.org/api/architecture/note_memory to see how we manage memory efficiently.

3 replies

Jul '20 ▶ goldpiggy

RaphaelCS

Thanks for the reference !

Aug '20

skywalker_H

I want to ensure some kind of achievement about back propagation which is equivalent to the normal way from my perspective. I think we could do merely the partial derivation and value calculation during the forwarding pass and multiplication during the back propagation. Is it an effective, or to say another equivalent way to achieve BP? Or they are different from some aspects?

1 reply

Aug '20 ▶ skywalker_H

StevenJokes

I think we could do merely the partial derivation and value calculation during the forwarding pass and multiplication during the back propagation.

What is different with BP?

1 reply

Aug '20 ▶ StevenJokes

skywalker_H

BP do the partial derivation during back propogation step rather than the forwarding step? I just want to ensure wether these two methods have the same effect. Thanks for your reply.

1 reply

Aug '20 ▶ skywalker_H

StevenJokes

The most important thing, in my opinion, is to find out which elements we need to calculate the partial derivation. Back propogation is only the way to find it. Am I right?

1 reply

Aug '20

skywalker_H

I agree with you and now I think two ways mentioned above are actually the same .

Aug '20 ▶ RaphaelCS

Anand_Ram

For BP to calculate the gradients with respect to the current loss, we need to actually do the forward pass and get the loss with the current parameters, right ? How can BP be done in the forward pass if we don’t even have the loss for the current parameters.

1 reply

Aug '20 ▶ Anand_Ram

RaphaelCS

In my opinion, we can calculate the partial derivative of h1 w.r.t. x and the partial derivative of h2 w.r.t.h1, then we can obtain the partial derivative of h2 w.r.t. x by multiplying the previous two partial derivatives. I think these calculations can all be done in forwarding steps. (x refers to input, h1 and h2 refer to two hidden layers)
If I’m not mistaken, we can keep calculating in this manner until the loss, thus all in forward stage.

BTW, perhaps this method requires much more memory to store the intermediate partial derivatives.

1 reply

Aug '20

goldpiggy

Great discussion @Anand_Ram and @RaphaelCS ! As you point out the memory may explode if we save all the immediate gradient. Besides, some of the saved gradients may not require for calculation. As a result, we usually do a full pass of forward propagation and then backpropagate.

Jan '21 ▶ goldpiggy

six

This link was very interesting in explaining some of mxnet’s efficiency results.

Jan '21

six

4.7.2.2

dJ/db_2 =
    prod(dJ/dL, dL/do, do/db_2)
    
    prod(1, 2(o - y), 1)

    2(o - y)

dJ/db_1 = 
    prod(dJ/dL, dL/do, do/dh, dh/dz, dz/db_1)    

    prod(1, 2(o - y), w_2, phi'(z), 1)

    prod(2(o - y), w_2, phi'(z))

Do these bias terms look ok?

1 reply

Feb '21 ▶ six

JH.Lam

seems correct in case of softmax regr.
but where does the coefficient ‘2’ above come from? @six

1 reply

Feb '21 ▶ goldpiggy

JH.Lam

all figures are missed at that link, is it the problem w.r.t. my network env?

Feb '21 ▶ JH.Lam

six

After looking at: https://math.stackexchange.com/questions/2792390/derivative-of-euclidean-norm-l2-norm

I realize:

d/do l2_norm(o - y)

d/do ||o - y||^2

d/do 2(y - o)

Here’s a good intuitive explanation as well from the top answer at the above link (x is o here):
“points in the direction of the vector away from y towards x: this makes sense, as the gradient of ∥y−x∥2 is the direction of steepest increase of ∥y−x∥2, which is to move x in the direction directly away from y.”

1 reply

Feb '21 ▶ six

JH.Lam

but i wonder know that why you add l2 on (o-y)? i think (o-y) is the derivative of ∂l/∂o while L2 is used in the bias whic like ||W||^2,so i think it’s a constant item .thanks

May '21

oat

I can understand equation 4.7.12, as shown below, and the order of the elements in the matrix multiplication makes sense, i.e. the derivative of J w.r.t. h (gradient of h) is the product of the transpose of the derivative of o w.r.t. h (which is W2 in this case) and the derivative of J w.r.t. o (i.e. the gradient of o).
我可以理解等式 4.7.12，也就是说，损失J对h的偏导数，也即h的梯度，等于h对于o的偏导数的转置与损失对于o的偏导数，也即o的梯度的乘积。

This is consistent with what Roger Grosse explained in his course that backprop is like passing error message, and the derivative of the loss w.r.t. a parent node/variable = SUM of the products between the transpose of the derivative of each of its child nodes/variables w.r.t. the parent (i.e. the transpose of the Jacobian matrix) and the derivative of the loss w.r.t. each of its children.
这与Roger Grosse的课程里对反向传播的解释是一致的，也就是说，反向传播是在传递误差信息，而损失对于当前父变量的导数，也即该父变量的梯度，等于所有其子变量对父变量的导数的转置，也即雅可比矩阵的转置与损失对于相应子变量的导数，也即子变量的梯度的乘积之总和。

However, this conceptual logic is not followed in equations 4.7.11 and 4.7.14 in which the “transpose” part is put after the “partial derivative part”, rather than in front of it. This could lead to problem in dimension mismatching for matrix multiplication in code implementation.
但似乎这一概念逻辑在等式 4.7.11和4.7.14中并没有被遵从，其中的雅可比矩阵转置的部分被放在了损失对子变量的导数的后面。这会导致在矩阵乘积运算中维度不匹配的问题。

Could the authors kindly explain if this is a typo, or the order in matrix multiplication as shown in the equations doesn’t matter because it is only “illustrative”? Thanks.
所以，可否请作者们解释一下，这是否是文字排版错误，抑或这个“顺序”在这些作为概念演示的等式里并不重要？谢谢。

5 replies

Jul '21

fanbyprinciple

Exercises and my silly answers

Assume that the inputs X to some scalar function f are n × m matrices. What is the dimensionality of the gradient of f with respect to X?

Dimensionality reamains the same I tried be look at X.grad values.

Add a bias to the hidden layer of the model described in this section (you do not need to

include bias in the regularization term).

Draw the corresponding computational graph.
Derive the forward and backward propagation equations.

given in the pic. There is almost no change.

Compute the memory footprint for training and prediction in the model described in this

section.

If we are to assume that X belongs to R^d then , you would need to store dl/do, dh/dz, X, lambda, W(1), W(2), h, X.grad so considering X size is d then 4d + 4. Based on gradients calculated above. Rest all can be derived.

Assume that you want to compute second derivatives. What happens to the computational

graph? How long do you expect the calculation to take?

dont know how to come up with a solution to this

Assume that the computational graph is too large for your GPU.
1. Can you partition it over more than one GPU?
- maybe computational grapha can be devided but there needs to be tmiing pipeline else one part would wait for the other. Maybe different batchwise the multiple GPU can be recruited.
1. What are the advantages and disadvantages over training on a smaller minibatch?
- smaller minibatch takes more time and less memory and vice versa. Its a time space tradeoff.

Jul '21 ▶ oat

fanbyprinciple

I cannot answer your question but I am in awe how rigorously you have asked your question!

Feb '22 ▶ oat

rafaelloo

Feb '22 ▶ oat

rafaelloo

Feb '22 ▶ oat

rafaelloo

I wrote it as gently as I could. Hope it helps you with your struggle with mathematical derivation.

May '22

Konstatin_Nedovodeev

As far as I can see the notation - do / dW(2) in (4.7.11) is a vector-by-matrix derivative. Where can I read about how do you interpret such a notation? The wikipedia article (" Matrix calculus") says “These are not as widely considered and a notation is not widely agreed upon.” in a paragraph “Other matrix derivatives”. Tried other sources, haven’t found anything.

Jul '22

n * m;
2.1 insert b^(1)->(+) between (x)->z; insert b^(2)->(+) between (x)->o;
2.2
forward: pass;
backward:
partial J/ partial W^(1) and p J / pl W^(2) remain the same;
p J / p b^(2) = p J / p o, p J / p b^(1) = p J / p z;
Forward: only need preserve the current variables;
backward: preserve almost all intermediate variables;
more complex computational graph? more time?
5.1 yes; communicate with mpi or other things like;
5.2 buy space with time.

22 Jun

TheVikingLord

If folks still confused take a look at

Andreas_Terzis

StevenJokes

RaphaelCS

StevenJokes

goldpiggy

RaphaelCS

skywalker_H

StevenJokes

skywalker_H

StevenJokes

skywalker_H

Anand_Ram

RaphaelCS

goldpiggy

six

six

JH.Lam

JH.Lam

six

JH.Lam

oat

fanbyprinciple

Exercises and my silly answers

fanbyprinciple

rafaelloo

rafaelloo

rafaelloo

Konstatin_Nedovodeev

justinchuby

pandalabme

pwnPH0fun

zhang2023-byte

TheVikingLord