For tied parameters (link), why is the gradient the sum of the gradients of the two layers? I was thinking it would be the product of the gradients of the two layers. Reasoning:

y = f(f(x))

dy/dx = fâ(f(x))*fâ(x) where x is a vector denoting the shared parameters.

(Cross posting from the D2L pytorch forum, since it does not really have anything to do with pytorch).

Hi @ganeshk, fantastic question! Even though it is not intuitively obvious, we design the operator by using âsumâ rather than âproductâ. That aligns with the idea how we learn a convolution kernel. Check this tutorial for more details.

This is helpful. Thanks. I suppose having a product is more likely to lead to problems like vanishing gradients. The sum should be more stable to that.

Great @ganeshk ! As you may understand now, theoretical intuition needs more practical experiments . Good luck!

Hi, is it possible to CustomInit weight parameter? Letâs say we have Con2D layer with size of 3x3 and in_channel = 1 = out_channel. I want to initialize weight to be np.arange(9). How to do it? Thank you.

Hi! Consider using `mxnet.init.Constant`

initializer:

```
>>> from mxnet import gluon, init, np, npx
>>> npx.set_np()
>>> net = gluon.nn.Sequential()
>>> net.add(gluon.nn.Dense(1))
>>> net
Sequential(
(0): Dense(-1 -> 1, linear)
)
>>> weights = np.ones(9)
>>> net[0].initialize(init=init.Constant(weights))
>>> output = net(np.random.uniform(size=(16, weights.shape[0])))
>>> net
Sequential(
(0): Dense(9 -> 1, linear)
)
```

Also, remember that you can use `load_parameters`

method to load the previously saved weights.

I have a doubt in the **custom initialization section** .

How is the above initialization is implemented in the code given in the book:

```
def my_init(m):
if type(m) == nn.Linear:
print(
"Init",
*[(name, param.shape) for name, param in
m.named_parameters()][0])
nn.init.uniform_(m.weight, -10, 10)
m.weight.data *= m.weight.data.abs() >= 5
net.apply(my_init)
net[0].weight[:2]
```

where is the probability applied in this code?

Hi, I am having problem understading the weight dimensions.

In Dense(8), shouldnât the weight matrix have 4 rows and 8 columns and in Dense(1), 8 rows and 1 column? As shape of weight matrix is always num_inputs(rows) and num_outputs(columns)

Please clarify

The probabilities are correct.

Recall that uniform distribution has a linear CDF: