Attention Scoring Functions

https://d2l.ai/chapter_attention-mechanisms/attention-scoring-functions.html

Under 10.1.2. MLP Attention
you writes: you can imagine 𝐖𝑘𝐤+𝐖𝑞𝐪Wkk+Wqq as concatenating the key and value in the feature dimension and feeding them to a single hidden layer perceptron with hidden layer size ℎh and output layer size 1, I think is not concatenating key and value, is concatenating key and query, right?

1 Like

yeah, that confused me. I think they meant concatenating key and query

Exercises 3

in my opinion, may be dot product may be better than vector summation.
exp:
the query vector is [w1,w2] and the key vector is [a,b]
if we use dot product ouput of dot is aw1+bw2 the space is a plane
but if we use summation the output is [a+w1,b+w2] the space is a dot of the plane.

so may be the dot product have a bigger solution space than summation, it can fit the dataset better than summation

加了一堆打印,终于摸清了一点门道。这广播机制用得好,让每个query都可以跟一组keys相互作用,找到与之对应的权重,从而得到相应的输出value.
这里的维度得好好思考思考。
new_q=queries.unsqueeze(2)
new_k=keys.unsqueeze(1)
features = new_q+ new_k
print(“features shape:\n”,new_q.shape,"\n",new_q,"\n", new_k.shape,"\n",new_k, “\n”,features.shape, “\n features:”,features)