https://d2l.ai/chapter_attention-mechanisms-and-transformers/attention-scoring-functions.html

Under 10.1.2. MLP Attention

you writes: you can imagine 𝐖𝑘𝐤+𝐖𝑞𝐪Wkk+Wqq as concatenating the key and value in the feature dimension and feeding them to a single hidden layer perceptron with hidden layer size ℎh and output layer size 1, I think is not concatenating key and value, is concatenating key and query, right?

yeah, that confused me. I think they meant concatenating key and query

## Exercises 3

in my opinion, may be dot product may be better than vector summation.

exp:

the query vector is [w1,w2] and the key vector is [a,b]

if we use dot product ouput of dot is a*w1+b*w2 the space is a plane

but if we use summation the output is [a+w1,b+w2] the space is a dot of the plane.

so may be the dot product have a bigger solution space than summation, it can fit the dataset better than summation

加了一堆打印，终于摸清了一点门道。这广播机制用得好，让每个query都可以跟一组keys相互作用，找到与之对应的权重，从而得到相应的输出value.

这里的维度得好好思考思考。

new_q=queries.unsqueeze(2)

new_k=keys.unsqueeze(1)

features = new_q+ new_k

print(“features shape:\n”,new_q.shape,"\n",new_q,"\n", new_k.shape,"\n",new_k, “\n”,features.shape, “\n features:”,features)

### Exercise2:

I think that it is possible to pad q or k, whichever is shorter, to achieve the same length.

### Exercise3:

If we compare the two formulars, one can find that Additive Attention incorporates 3 learnable parameters: W_{q}, W_{k} and W_{v} while Scaled Dot-Product Attention does not introduce any learnable parameter.

I would suspect that, if we actually train the model of Additive Attention, the 3 learnable parameters will shape the model to behave like Scaled Dot-Product Attention. In otherwords, the W_{q}, W_{k} will help **query** to project itself onto **key**

Only a suspicion…

Not really.

Suppose Wq = [wq1, wq2], Wk = [wk1, wk2], so additive scoring would be w1xwq1 + w2xwq2 + axwk1 + bxwk2. So it would also be a “line” (In your case a plane). So, I think solution plane is the same.

Why do we use 1e-6 instead of 0 in softmax mask?

ops… I confused 1e-6 with -1e6.

Actually, the implementation cheats ever so slightly by setting the values of v_𝑖, for 𝑖 > 𝑙, to zero. Moreover, it sets the attention weights to a large negative number, such as −10^6, in order to make their contribution to gradients and values vanish in practice.

IMHO, it doesn’t make sense and doesn’t match the code below. Can anyone help?

In section 11.3.4, first code block, last comment, you say

```
# Shape of values: (batch_size, no. of key-value pairs, value
# dimension)
```

It should be

```
# Shape of values: (batch_size, no. of query pairs, value
# dimension)
```

, right?

One thing that helped me understand the additive attention mechanism is that q and k may not just be different sizes. The dimensions of q and k can (and probably do in something like NLP) represent totally different things. All you can say for sure is there is some unknown space where they represent the same thing. In translation for example, both q and k represent some stage in the meaning of a sentence. Wq and Wk are transformations that put q and k in the same space. Obviously, we don’t know what those transformations are, but we let the machine learn them. They just let you add q and k meaningfully. I.e. Wq q = q’ and Wk k = k’ and now q’1 and k’1 represent the same thing. The tanh does the same thing as dividing by sqrt(d). It makes the sum 0 centered with a variance of 1. Wv does the adding, with some learned weights to basically focus what part of this new common space contribute the most to the sum. If q and k are already on the same space, this reduces to the scaled dot product.