Under 10.1.2. MLP Attention
you writes: you can imagine 𝐖𝑘𝐤+𝐖𝑞𝐪Wkk+Wqq as concatenating the key and value in the feature dimension and feeding them to a single hidden layer perceptron with hidden layer size ℎh and output layer size 1, I think is not concatenating key and value, is concatenating key and query, right?
yeah, that confused me. I think they meant concatenating key and query
in my opinion, may be dot product may be better than vector summation.
the query vector is [w1,w2] and the key vector is [a,b]
if we use dot product ouput of dot is aw1+bw2 the space is a plane
but if we use summation the output is [a+w1,b+w2] the space is a dot of the plane.
so may be the dot product have a bigger solution space than summation, it can fit the dataset better than summation
features = new_q+ new_k
print(“features shape:\n”,new_q.shape,"\n",new_q,"\n", new_k.shape,"\n",new_k, “\n”,features.shape, “\n features:”,features)
I think that it is possible to pad q or k, whichever is shorter, to achieve the same length.
If we compare the two formulars, one can find that Additive Attention incorporates 3 learnable parameters: Wq, Wk and Wv while Scaled Dot-Product Attention does not introduce any learnable parameter.
I would suspect that, if we actually train the model of Additive Attention, the 3 learnable parameters will shape the model to behave like Scaled Dot-Product Attention. In otherwords, the Wq, Wk will help query to project itself onto key
Only a suspicion…
Suppose Wq = [wq1, wq2], Wk = [wk1, wk2], so additive scoring would be w1xwq1 + w2xwq2 + axwk1 + bxwk2. So it would also be a “line” (In your case a plane). So, I think solution plane is the same.
Why do we use 1e-6 instead of 0 in softmax mask?
ops… I confused 1e-6 with -1e6.