Attention Scoring Functions

astonzhang · June 29, 2020, 10:04pm

https://d2l.ai/chapter_attention-mechanisms-and-transformers/attention-scoring-functions.html

Van_Tran · July 14, 2020, 1:49am

For the batched dot product I don’t understand why we divide by sqrt(d) instead of d. In my eyes d perfectly takes out the problem of higher dimensions blowing up the dot product and the square root adds needless complexity. However, the first intuition is to use d, so there must be a reason for the sqrt. What am I missing here?

goldpiggy · July 14, 2020, 4:38pm

Hi @Van_Tran, great question! Please refer to the original paper section 3.2.1:

The reason we don’t use d is similar, a large d may lead to extremely small gradients.

pyzeus · July 26, 2020, 2:12pm

It is really useful. Will there be tensorflow implementation for the remaining topics?.

goldpiggy · July 27, 2020, 5:06pm

Hi @pyzeus, thanks! Yeah we are writing the TF version now!

Richard · July 27, 2020, 6:46pm

Is this correct in PyTorch MLPAttention?

query, key = self.W_k(query), self.W_q(key)

I think it should be

query, key = self.W_q(query), self.W_k(key)

pyzeus · July 28, 2020, 10:26am

Thank you so much for the effort!

goldpiggy · July 28, 2020, 4:47pm

Great call @Richard ! Will fix and test it!

toliveira · June 5, 2021, 2:23am

Hi @Van_Tran! Let’s call the batched dot product X. As mentioned in the text, Var(X)=d. In order to reduce the variance to 1, you have to divide X by sqrt(d), since Var(aX) = a^2 * Var(X). If you divided X by d, the variance of the result would be 1/sqrt(d), which can be too small for large values of d.

zophy · November 9, 2021, 8:42am

Why not add a bias term when calculating attention score ? Anyone know ?

sanjaradylov · February 9, 2022, 4:10pm

@zophy I guess because we don’t need our scores to be shifted with biases if keys and queries are close to zero.