Attention Scoring Functions

For the batched dot product I don’t understand why we divide by sqrt(d) instead of d. In my eyes d perfectly takes out the problem of higher dimensions blowing up the dot product and the square root adds needless complexity. However, the first intuition is to use d, so there must be a reason for the sqrt. What am I missing here?

Hi @Van_Tran, great question! Please refer to the original paper section 3.2.1:

The reason we don’t use d is similar, a large d may lead to extremely small gradients.

It is really useful. Will there be tensorflow implementation for the remaining topics?.

Hi @pyzeus, thanks! Yeah we are writing the TF version now!

1 Like

Is this correct in PyTorch MLPAttention?

query, key = self.W_k(query), self.W_q(key)

I think it should be

query, key = self.W_q(query), self.W_k(key)

Thank you so much for the effort!

Great call @Richard ! Will fix and test it!