For tied parameters (link), why is the gradient the sum of the gradients of the two layers? I was thinking it would be the product of the gradients of the two layers. Reasoning:
y = f(f(x))
dy/dx = f’(f(x))*f’(x) where x is a vector denoting the shared parameters.
(Cross posting from the D2L pytorch forum, since it does not really have anything to do with pytorch).
Hi @ganeshk, fantastic question! Even though it is not intuitively obvious, we design the operator by using “sum” rather than “product”. That aligns with the idea how we learn a convolution kernel. Check this tutorial for more details.
This is helpful. Thanks. I suppose having a product is more likely to lead to problems like vanishing gradients. The sum should be more stable to that.
Great @ganeshk ! As you may understand now, theoretical intuition needs more practical experiments . Good luck!