Questions of Attention Mechanisms

In the MLP attention section, does W_kk+W_qq means concatenation in pytorch?. Its said that

Intuitively, you can imagine Wkk+Wqq as concatenating the key and value in the feature dimension

So are we adding or concatenating?

Hi @pyzeus, great question. Here, the sentence is interpreting the “W_k * k + W_q * q” in the formula 10.1.7, rather than purely “added” as a traditional math formula. As you can see in the implementation in Pytorch, we broadcast the query and key in the forward function. Let me know if it helps?!

https://d2l.ai/chapter_attention-mechanisms/attention.html

Sorry for the late reply. I understood that we are not adding the query and key. But why don’t we do torch.cat([query,key],dim=-1) if we want to concatenate along feature dimension, this is how it should be right? or did I misconstrued the statement?.