Numerical Stability and Initialization

Dear all, may I know why do we want to keep the variance fixed? In other words, how keeping variance fixed help to solve the Vanishing and Exploding Gradients issue? Thanks.

Hi @Gavin, great question! If weights are too small of too large, their gradients will be problematic as we elaborate here. Let me know if it is not clear enough.


Hi, awesome and detailed explanation of the numerical stability concept ! I have one question though: isn’t the Xavier initialization outdated since the tanh activation function was used during its creation? Isn’t the He initialization more suited for the mentioned relu activation function? Thanks in advance

Exercises and my silly answers

  1. Can you design other cases where a neural network might exhibit symmetry requiring breaking besides the permutation symmetry in an MLP’s layers?
  • loss and regularisation
  1. Can we initialize all weight parameters in linear regression or in softmax regression to the same value?
  • We can try it but it may lead to symmetry condition
  1. Look up analytic bounds on the eigenvalues of the product of two matrices. What does this tell you about ensuring that gradients are well conditioned?
  • The largest eigenvalue of such a matrix (symmetric) is equal to the matrix norm. Say your two matrices are A and B.


where λ1,A is the largest eigenvalue of A and λ1,B is the largest eigenvalue of B. So the largest eigenvalue of the product is upper-bounded by the product of the largest eigenvalues of the two matrices. For a proof of what I just asserted, see: Norm of a symmetric matrix equals spectral radius

In terms of the smallest, it looks like the product of the smallest two eigenvalues also gives you a lower bound on the smallest eigenvalue of the product. For a complete reference on how the eigenvalues are related, see:

  • Condition number (l) = max value of eigen value divided by min value of eigen value.

If l >> 1 then it we say it is ill-conditioned

If l approx = 1 the well-conditioned.

  1. If we know that some terms diverge, can we fix this after the fact? Look at the paper on layerwise adaptive rate scaling for inspiration :cite:You.Gitman.Ginsburg.2017.
  • LARS uses seperate learning rate for each layer.
1 Like

Why this conclusion is True? The formula 4.8.4 indicates the variance of outputs depends the number of inputs.

In the book, it says ‘the sigmoid’s gradient vanishes both when its inputs are large and when they are small … Consequently, ReLUs, which are more stable (but less neurally plausible), have emerged as the default choice for practitioners.’
It seems Sigmoid can cause lots of problems so we should avoid using it. My question is: under what condition, Sigmoid is a better choice? Any hints would be appreciated. Thanks.

May I have a question about symmetry problem? I’m wondering why minibatch stochastic gradient descent would not break this symmetry but dropout regularization would? In my opinion, the dropout would not work if the output weight have the same value for each unit of the hidden layer weight. Am I correct?

From what I can gather the reason Dropout would work is that at different steps of the optimization you have different weights that are active (according to the mask). Although during a single step the activated weight would be modified in the same way when we draw the next mask we have already introduced differences between the weights on the previous step. This will add up each time we change the mask breaking the symmetry.

This conclusion is True if and only if Xavier initialization is used.


First, it is a great section to introduce the initialization in DL. However, I found a mistake in the derivation for formula 5.4.4. The result is correct but there is a mistake in the intermediate part. Specifically, in the variance, Var(o_i) = \sum E(w^2x^2) is not equal to \sum E(w^2)E(x^2). It should be Var(o_i) = \sum E(w^2x^2) = \sum (Var(wx) +(E(w*x))^2 ) (due to Var(xy) = E(x^2y^2) - E(xy)^2) = \sum (Var(wx) ) (due to E(wx) = 0) = \sum (var(x)var(w)) = n \sigma^2 \gamma^2.

Oh, never mind. The var(x)var(w) = [E(x^2) - E(x)^2][E(w^2) - E(w)^2] = E(x^2)E(w^2) due to E(x) = E(w) = 0. The formula is correct. ignore what I said lol. Thank you again for the awesome section!

var(||X_in||) = 1 in this case, thus it ensures the gradient is neither increasing nor shrinking (from the expectation perspective, in reality probably still not good enough)

Hi, thanks for this awesome section.

Regarding Equation 5.4.4: I agree with the result, but maybe overlook something in the following step:

E[o_i^2] - (E[o_i])^2 = \sum {j=1}^{n{in}} E[w_{ij}^2 x_j ^2] - 0

I rather think this (not equivalent) equality is true
E[o_i^2] - (E[o_i])^2 = E[(\sum {j=1}^{n{in}} w_{ij} x_j )^2] - 0

However, from here I don’t immediately see how to achieve the result. An alternative derivation here achieves that without this concern of mine.

Any clarification much appreciated :slight_smile:

Thus we are susceptible to the same problems of numerical underflow that often crop up when multiplying together too many probabilities.

Why are we only concerned about underflow here? Couldn’t overflow also be an issue to the same extent?

Unfortunately, our problem above is more serious: initially the matrices M^(l) may have a wide variety of eigenvalues. They might be small or large, and their product might be very large or very small.

I suppose we should be concerned here with singular values, as for a general matrix (not necessarily square), it is the singular values that characterize the stretching/shrinking effect a matrix has on a vector. Am I missing something?

My solutions to the exs: 5.4