Attention Scoring Functions

Under 10.1.2. MLP Attention
you writes: you can imagine 𝐖𝑘𝐤+𝐖𝑞𝐪Wkk+Wqq as concatenating the key and value in the feature dimension and feeding them to a single hidden layer perceptron with hidden layer size ℎh and output layer size 1, I think is not concatenating key and value, is concatenating key and query, right?

1 Like

yeah, that confused me. I think they meant concatenating key and query

Exercises 3

in my opinion, may be dot product may be better than vector summation.
exp:
the query vector is [w1,w2] and the key vector is [a,b]
if we use dot product ouput of dot is aw1+bw2 the space is a plane
but if we use summation the output is [a+w1,b+w2] the space is a dot of the plane.

so may be the dot product have a bigger solution space than summation, it can fit the dataset better than summation

new_q=queries.unsqueeze(2)
new_k=keys.unsqueeze(1)
features = new_q+ new_k
print(“features shape:\n”,new_q.shape,"\n",new_q,"\n", new_k.shape,"\n",new_k, “\n”,features.shape, “\n features:”,features)

Exercise2:

I think that it is possible to pad q or k, whichever is shorter, to achieve the same length.

Exercise3:

If we compare the two formulars, one can find that Additive Attention incorporates 3 learnable parameters: Wq, Wk and Wv while Scaled Dot-Product Attention does not introduce any learnable parameter.
I would suspect that, if we actually train the model of Additive Attention, the 3 learnable parameters will shape the model to behave like Scaled Dot-Product Attention. In otherwords, the Wq, Wk will help query to project itself onto key
Only a suspicion…

Not really.
Suppose Wq = [wq1, wq2], Wk = [wk1, wk2], so additive scoring would be w1xwq1 + w2xwq2 + axwk1 + bxwk2. So it would also be a “line” (In your case a plane). So, I think solution plane is the same.

Why do we use 1e-6 instead of 0 in softmax mask?

ops… I confused 1e-6 with -1e6.

My solutions to the exs: 11.3

1 Like

Actually, the implementation cheats ever so slightly by setting the values of v_𝑖, for 𝑖 > 𝑙, to zero. Moreover, it sets the attention weights to a large negative number, such as −10^6, in order to make their contribution to gradients and values vanish in practice.

IMHO, it doesn’t make sense and doesn’t match the code below. Can anyone help?

In section 11.3.4, first code block, last comment, you say

``````   # Shape of values: (batch_size, no. of key-value pairs, value
# dimension)
``````

It should be

``````   # Shape of values: (batch_size, no. of query pairs, value
# dimension)
``````

, right?