Numerical Stability and Initialization

mli · May 31, 2020, 2:50am

http://d2l.ai/chapter_multilayer-perceptrons/numerical-stability-and-init.html

Gavin · August 31, 2020, 6:17am

Dear all, may I know why do we want to keep the variance fixed? In other words, how keeping variance fixed help to solve the Vanishing and Exploding Gradients issue? Thanks.

goldpiggy · September 1, 2020, 12:34am

Hi @Gavin, great question! If weights are too small of too large, their gradients will be problematic as we elaborate here. Let me know if it is not clear enough.

bkoch · April 7, 2021, 8:29am

Hi, awesome and detailed explanation of the numerical stability concept ! I have one question though: isn’t the Xavier initialization outdated since the tanh activation function was used during its creation? Isn’t the He initialization more suited for the mentioned relu activation function? Thanks in advance

fanbyprinciple · July 31, 2021, 1:42am

Exercises and my silly answers

Can you design other cases where a neural network might exhibit symmetry requiring breaking besides the permutation symmetry in an MLP’s layers?

loss and regularisation

Can we initialize all weight parameters in linear regression or in softmax regression to the same value?

We can try it but it may lead to symmetry condition

Look up analytic bounds on the eigenvalues of the product of two matrices. What does this tell you about ensuring that gradients are well conditioned?

The largest eigenvalue of such a matrix (symmetric) is equal to the matrix norm. Say your two matrices are A and B.

∥AB∥≤∥A∥∥B∥=λ1,Aλ1,B

where λ1,A is the largest eigenvalue of A and λ1,B is the largest eigenvalue of B. So the largest eigenvalue of the product is upper-bounded by the product of the largest eigenvalues of the two matrices. For a proof of what I just asserted, see: Norm of a symmetric matrix equals spectral radius

In terms of the smallest, it looks like the product of the smallest two eigenvalues also gives you a lower bound on the smallest eigenvalue of the product. For a complete reference on how the eigenvalues are related, see: https://mathoverflow.net/questions/106191/eigenvalues-of-product-of-two-symmetric-matrices

Condition number (l) = max value of eigen value divided by min value of eigen value.

If l >> 1 then it we say it is ill-conditioned

If l approx = 1 the well-conditioned.

If we know that some terms diverge, can we fix this after the fact? Look at the paper on layerwise adaptive rate scaling for inspiration :cite:You.Gitman.Ginsburg.2017.

LARS uses seperate learning rate for each layer.

Yiyun_Lu · August 17, 2021, 9:19am

Why this conclusion is True? The formula 4.8.4 indicates the variance of outputs depends the number of inputs.

Dan · September 2, 2021, 3:23am

In the book, it says ‘the sigmoid’s gradient vanishes both when its inputs are large and when they are small … Consequently, ReLUs, which are more stable (but less neurally plausible), have emerged as the default choice for practitioners.’
It seems Sigmoid can cause lots of problems so we should avoid using it. My question is: under what condition, Sigmoid is a better choice? Any hints would be appreciated. Thanks.

Dan · September 2, 2021, 4:04pm

May I have a question about symmetry problem? I’m wondering why minibatch stochastic gradient descent would not break this symmetry but dropout regularization would? In my opinion, the dropout would not work if the output weight have the same value for each unit of the hidden layer weight. Am I correct?

Danail_Stoyanov · October 19, 2021, 1:37pm

From what I can gather the reason Dropout would work is that at different steps of the optimization you have different weights that are active (according to the mask). Although during a single step the activated weight would be modified in the same way when we draw the next mask we have already introduced differences between the weights on the previous step. This will add up each time we change the mask breaking the symmetry.

redwangwangwang · November 30, 2021, 12:23pm

This conclusion is True if and only if Xavier initialization is used.

Zekai_Wang · September 23, 2022, 2:17am

Hi,

First, it is a great section to introduce the initialization in DL. However, I found a mistake in the derivation for formula 5.4.4. The result is correct but there is a mistake in the intermediate part. Specifically, in the variance, Var(o_i) = \sum E(w^2x^2) is not equal to \sum E(w^2)E(x^2). It should be Var(o_i) = \sum E(w^2x^2) = \sum (Var(wx) +(E(w*x))^2 ) (due to Var(xy) = E(x^2y^2) - E(xy)^2) = \sum (Var(wx) ) (due to E(wx) = 0) = \sum (var(x)var(w)) = n \sigma^2 \gamma^2.

Zekai_Wang · September 24, 2022, 3:50pm

Oh, never mind. The var(x)var(w) = [E(x^2) - E(x)^2][E(w^2) - E(w)^2] = E(x^2)E(w^2) due to E(x) = E(w) = 0. The formula is correct. ignore what I said lol. Thank you again for the awesome section!

yzzzzz · February 18, 2023, 7:52am

var(||X_in||) = 1 in this case, thus it ensures the gradient is neither increasing nor shrinking (from the expectation perspective, in reality probably still not good enough)

eroell · July 11, 2023, 9:09am

Hi, thanks for this awesome section.

Regarding Equation 5.4.4: I agree with the result, but maybe overlook something in the following step:

E[o_i^2] - (E[o_i])^2 = \sum {j=1}^{n{in}} E[w_{ij}^2 x_j ^2] - 0

I rather think this (not equivalent) equality is true
E[o_i^2] - (E[o_i])^2 = E[(\sum {j=1}^{n{in}} w_{ij} x_j )^2] - 0

However, from here I don’t immediately see how to achieve the result. An alternative derivation here achieves that without this concern of mine.

Any clarification much appreciated

dingcurie · August 18, 2023, 12:57am

Thus we are susceptible to the same problems of numerical underflow that often crop up when multiplying together too many probabilities.

Why are we only concerned about underflow here? Couldn’t overflow also be an issue to the same extent?

dingcurie · August 18, 2023, 1:04am

Unfortunately, our problem above is more serious: initially the matrices M^(l) may have a wide variety of eigenvalues. They might be small or large, and their product might be very large or very small.

I suppose we should be concerned here with singular values, as for a general matrix (not necessarily square), it is the singular values that characterize the stretching/shrinking effect a matrix has on a vector. Am I missing something?

pandalabme · August 22, 2023, 1:07am

My solutions to the exs: 5.4

gaokai320 · June 15, 2024, 6:44am

What is the shape of $\frac{\partial \mathbf{o}}{\partial \mathbf{W}^{l}}$ in Equation 5.4.2? And it seems that the shape of the product of L -1 matrices and the shape of the gradient vector v^l mismatch.

cmou · October 11, 2024, 8:39am

Seems this formula was stick to the same as old version book’s one (4.8.2). I assume it should be applying multiple differentiates with respect to weight of layers, instead of differentiates wrt. their intermediary outcome h.

Nicolas_Victorion · October 12, 2024, 7:21am

In my opinion, $\partial_{W^{(l)}}o$ and $\partial_{W^{(l)}}h^{(l)}$ are both third-order tensors.