Under 10.1.2. MLP Attention
you writes: you can imagine 𝐖𝑘𝐤+𝐖𝑞𝐪Wkk+Wqq as concatenating the key and value in the feature dimension and feeding them to a single hidden layer perceptron with hidden layer size ℎh and output layer size 1, I think is not concatenating key and value, is concatenating key and query, right?
yeah, that confused me. I think they meant concatenating key and query
in my opinion, may be dot product may be better than vector summation.
the query vector is [w1,w2] and the key vector is [a,b]
if we use dot product ouput of dot is aw1+bw2 the space is a plane
but if we use summation the output is [a+w1,b+w2] the space is a dot of the plane.
so may be the dot product have a bigger solution space than summation, it can fit the dataset better than summation