Questions of Attention Mechanisms

In the MLP attention section, does W_kk+W_qq means concatenation in pytorch?. Its said that

Intuitively, you can imagine Wkk+Wqq as concatenating the key and value in the feature dimension

So are we adding or concatenating?

Hi @pyzeus, great question. Here, the sentence is interpreting the “W_k * k + W_q * q” in the formula 10.1.7, rather than purely “added” as a traditional math formula. As you can see in the implementation in Pytorch, we broadcast the query and key in the forward function. Let me know if it helps?!

Sorry for the late reply. I understood that we are not adding the query and key. But why don’t we do[query,key],dim=-1) if we want to concatenate along feature dimension, this is how it should be right? or did I misconstrued the statement?.