Numerical Stability and Initialization

Dear all, may I know why do we want to keep the variance fixed? In other words, how keeping variance fixed help to solve the Vanishing and Exploding Gradients issue? Thanks.

Hi @Gavin, great question! If weights are too small of too large, their gradients will be problematic as we elaborate here. Let me know if it is not clear enough.

Hi, awesome and detailed explanation of the numerical stability concept ! I have one question though: isn’t the Xavier initialization outdated since the tanh activation function was used during its creation? Isn’t the He initialization more suited for the mentioned relu activation function? Thanks in advance