# Forward Propagation, Backward Propagation, and Computational Graphs

but i wonder know that why you add l2 on (o-y)? i think (o-y) is the derivative of ∂l/∂o while L2 is used in the bias whic like ||W||^2,so i think it’s a constant item .thanks

I can understand equation 4.7.12, as shown below, and the order of the elements in the matrix multiplication makes sense, i.e. the derivative of J w.r.t. h (gradient of h) is the product of the transpose of the derivative of o w.r.t. h (which is W2 in this case) and the derivative of J w.r.t. o (i.e. the gradient of o).

This is consistent with what Roger Grosse explained in his course that backprop is like passing error message, and the derivative of the loss w.r.t. a parent node/variable = SUM of the products between the transpose of the derivative of each of its child nodes/variables w.r.t. the parent (i.e. the transpose of the Jacobian matrix) and the derivative of the loss w.r.t. each of its children.

However, this conceptual logic is not followed in equations 4.7.11 and 4.7.14 in which the “transpose” part is put after the “partial derivative part”, rather than in front of it. This could lead to problem in dimension mismatching for matrix multiplication in code implementation.

Could the authors kindly explain if this is a typo, or the order in matrix multiplication as shown in the equations doesn’t matter because it is only “illustrative”? Thanks.

2 Likes

### Exercises and my silly answers

1. Assume that the inputs X to some scalar function f are n × m matrices. What is the dimensionality of the gradient of f with respect to X?
• Dimensionality reamains the same I tried be look at X.grad values.
1. Add a bias to the hidden layer of the model described in this section (you do not need to

include bias in the regularization term).

1. Draw the corresponding computational graph.

2. Derive the forward and backward propagation equations.

• given in the pic. There is almost no change.

1. Compute the memory footprint for training and prediction in the model described in this

section.

• If we are to assume that X belongs to R^d then , you would need to store dl/do, dh/dz, X, lambda, W(1), W(2), h, X.grad so considering X size is d then 4d + 4. Based on gradients calculated above. Rest all can be derived.
1. Assume that you want to compute second derivatives. What happens to the computational

graph? How long do you expect the calculation to take?

• dont know how to come up with a solution to this
1. Assume that the computational graph is too large for your GPU.

1. Can you partition it over more than one GPU?
• maybe computational grapha can be devided but there needs to be tmiing pipeline else one part would wait for the other. Maybe different batchwise the multiple GPU can be recruited.
1. What are the advantages and disadvantages over training on a smaller minibatch?
• smaller minibatch takes more time and less memory and vice versa. Its a time space tradeoff.