https://d2l.ai/chapter_attention-mechanisms/attention-scoring-functions.html

For the batched dot product I don’t understand why we divide by sqrt(d) instead of d. In my eyes d perfectly takes out the problem of higher dimensions blowing up the dot product and the square root adds needless complexity. However, the first intuition is to use d, so there must be a reason for the sqrt. What am I missing here?

Hi @Van_Tran, great question! Please refer to the original paper section 3.2.1:

The reason we don’t use d is similar, a large d may lead to extremely small gradients.

It is really useful. Will there be tensorflow implementation for the remaining topics?.

Is this correct in PyTorch MLPAttention?

```
query, key = self.W_k(query), self.W_q(key)
```

I think it should be

```
query, key = self.W_q(query), self.W_k(key)
```

Thank you so much for the effort!