For the batched dot product I don’t understand why we divide by sqrt(d) instead of d. In my eyes d perfectly takes out the problem of higher dimensions blowing up the dot product and the square root adds needless complexity. However, the first intuition is to use d, so there must be a reason for the sqrt. What am I missing here?
The reason we don’t use d is similar, a large d may lead to extremely small gradients.
It is really useful. Will there be tensorflow implementation for the remaining topics?.
Hi @pyzeus, thanks! Yeah we are writing the TF version now!
Is this correct in PyTorch MLPAttention?
query, key = self.W_k(query), self.W_q(key)
I think it should be
query, key = self.W_q(query), self.W_k(key)
Thank you so much for the effort!
Great call @Richard ! Will fix and test it!