Attention Scoring Functions

astonzhang · September 17, 2020, 4:52am

https://d2l.ai/chapter_attention-mechanisms-and-transformers/attention-scoring-functions.html

NickYi1990 · October 23, 2020, 6:51am

Under 10.1.2. MLP Attention
you writes: you can imagine 𝐖𝑘𝐤+𝐖𝑞𝐪Wkk+Wqq as concatenating the key and value in the feature dimension and feeding them to a single hidden layer perceptron with hidden layer size ℎh and output layer size 1, I think is not concatenating key and value, is concatenating key and query, right?

jorge · December 16, 2020, 10:56pm

yeah, that confused me. I think they meant concatenating key and query

min_xu · April 16, 2021, 2:21am

Exercises 3

in my opinion, may be dot product may be better than vector summation.
exp:
the query vector is [w1,w2] and the key vector is [a,b]
if we use dot product ouput of dot is aw1+bw2 the space is a plane
but if we use summation the output is [a+w1,b+w2] the space is a dot of the plane.

so may be the dot product have a bigger solution space than summation, it can fit the dataset better than summation

zppet · September 7, 2021, 3:51am

加了一堆打印，终于摸清了一点门道。这广播机制用得好，让每个query都可以跟一组keys相互作用，找到与之对应的权重，从而得到相应的输出value.
这里的维度得好好思考思考。
new_q=queries.unsqueeze(2)
new_k=keys.unsqueeze(1)
features = new_q+ new_k
print(“features shape:\n”,new_q.shape,"\n",new_q,"\n", new_k.shape,"\n",new_k, “\n”,features.shape, “\n features:”,features)

captainst · September 25, 2021, 11:04am

Exercise2:

I think that it is possible to pad q or k, whichever is shorter, to achieve the same length.

Exercise3:

If we compare the two formulars, one can find that Additive Attention incorporates 3 learnable parameters: W_q, W_k and W_v while Scaled Dot-Product Attention does not introduce any learnable parameter.
I would suspect that, if we actually train the model of Additive Attention, the 3 learnable parameters will shape the model to behave like Scaled Dot-Product Attention. In otherwords, the W_q, W_k will help query to project itself onto key
Only a suspicion…

vnminhhuy2001 · January 16, 2022, 1:02pm

Not really.
Suppose Wq = [wq1, wq2], Wk = [wk1, wk2], so additive scoring would be w1xwq1 + w2xwq2 + axwk1 + bxwk2. So it would also be a “line” (In your case a plane). So, I think solution plane is the same.

CaffeinatedSuperhero · April 21, 2022, 6:05am

Why do we use 1e-6 instead of 0 in softmax mask?

ops… I confused 1e-6 with -1e6.

pandalabme · September 10, 2023, 1:43am

My solutions to the exs: 11.3

dingcurie · September 16, 2023, 12:27pm

Actually, the implementation cheats ever so slightly by setting the values of v_𝑖, for 𝑖 > 𝑙, to zero. Moreover, it sets the attention weights to a large negative number, such as −10^6, in order to make their contribution to gradients and values vanish in practice.

IMHO, it doesn’t make sense and doesn’t match the code below. Can anyone help?

huanliu · April 19, 2024, 2:42pm

In section 11.3.4, first code block, last comment, you say

   # Shape of values: (batch_size, no. of key-value pairs, value
        # dimension)

It should be

   # Shape of values: (batch_size, no. of query pairs, value
        # dimension)

, right?

fadecomic · September 27, 2024, 7:55pm

One thing that helped me understand the additive attention mechanism is that q and k may not just be different sizes. The dimensions of q and k can (and probably do in something like NLP) represent totally different things. All you can say for sure is there is some unknown space where they represent the same thing. In translation for example, both q and k represent some stage in the meaning of a sentence. Wq and Wk are transformations that put q and k in the same space. Obviously, we don’t know what those transformations are, but we let the machine learn them. They just let you add q and k meaningfully. I.e. Wq q = q’ and Wk k = k’ and now q’1 and k’1 represent the same thing. The tanh does the same thing as dividing by sqrt(d). It makes the sum 0 centered with a variance of 1. Wv does the adding, with some learned weights to basically focus what part of this new common space contribute the most to the sum. If q and k are already on the same space, this reduces to the scaled dot product.