Attention Scoring Functions

https://d2l.ai/chapter_attention-mechanisms-and-transformers/attention-scoring-functions.html

Under 10.1.2. MLP Attention
you writes: you can imagine 𝐖𝑘𝐤+𝐖𝑞𝐪Wkk+Wqq as concatenating the key and value in the feature dimension and feeding them to a single hidden layer perceptron with hidden layer size ℎh and output layer size 1, I think is not concatenating key and value, is concatenating key and query, right?

1 Like

yeah, that confused me. I think they meant concatenating key and query

Exercises 3

in my opinion, may be dot product may be better than vector summation.
exp:
the query vector is [w1,w2] and the key vector is [a,b]
if we use dot product ouput of dot is aw1+bw2 the space is a plane
but if we use summation the output is [a+w1,b+w2] the space is a dot of the plane.

so may be the dot product have a bigger solution space than summation, it can fit the dataset better than summation

加了一堆打印,终于摸清了一点门道。这广播机制用得好,让每个query都可以跟一组keys相互作用,找到与之对应的权重,从而得到相应的输出value.
这里的维度得好好思考思考。
new_q=queries.unsqueeze(2)
new_k=keys.unsqueeze(1)
features = new_q+ new_k
print(“features shape:\n”,new_q.shape,"\n",new_q,"\n", new_k.shape,"\n",new_k, “\n”,features.shape, “\n features:”,features)

Exercise2:

I think that it is possible to pad q or k, whichever is shorter, to achieve the same length.

Exercise3:

If we compare the two formulars, one can find that Additive Attention incorporates 3 learnable parameters: Wq, Wk and Wv while Scaled Dot-Product Attention does not introduce any learnable parameter.
I would suspect that, if we actually train the model of Additive Attention, the 3 learnable parameters will shape the model to behave like Scaled Dot-Product Attention. In otherwords, the Wq, Wk will help query to project itself onto key
Only a suspicion…

Not really.
Suppose Wq = [wq1, wq2], Wk = [wk1, wk2], so additive scoring would be w1xwq1 + w2xwq2 + axwk1 + bxwk2. So it would also be a “line” (In your case a plane). So, I think solution plane is the same.

Why do we use 1e-6 instead of 0 in softmax mask? :thinking:


ops… I confused 1e-6 with -1e6. :rofl:

My solutions to the exs: 11.3

1 Like

Actually, the implementation cheats ever so slightly by setting the values of v_𝑖, for 𝑖 > 𝑙, to zero. Moreover, it sets the attention weights to a large negative number, such as −10^6, in order to make their contribution to gradients and values vanish in practice.

IMHO, it doesn’t make sense and doesn’t match the code below. Can anyone help?