http://d2l.ai/chapter_deep-learning-computation/parameters.html

For tied parameters (link), why is the gradient the sum of the gradients of the two layers? I was thinking it would be the product of the gradients of the two layers. Reasoning:

y = f(f(x))

dy/dx = fâ€™(f(x))*fâ€™(x) where x is a vector denoting the shared parameters.

(Cross posting from the D2L pytorch forum, since it does not really have anything to do with pytorch).

Hi @ganeshk, fantastic question! Even though it is not intuitively obvious, we design the operator by using â€śsumâ€ť rather than â€śproductâ€ť. That aligns with the idea how we learn a convolution kernel. Check this tutorial for more details.

This is helpful. Thanks. I suppose having a product is more likely to lead to problems like vanishing gradients. The sum should be more stable to that.

Great @ganeshk ! As you may understand now, theoretical intuition needs more practical experiments . Good luck!