Dear all, may I know why do we want to keep the variance fixed? In other words, how keeping variance fixed help to solve the Vanishing and Exploding Gradients issue? Thanks.
Hi, awesome and detailed explanation of the numerical stability concept ! I have one question though: isn’t the Xavier initialization outdated since the tanh activation function was used during its creation? Isn’t the He initialization more suited for the mentioned relu activation function? Thanks in advance