Self-Attention and Positional Encoding

astonzhang · December 30, 2020, 3:18am

https://d2l.ai/chapter_attention-mechanisms-and-transformers/self-attention-and-positional-encoding.html

singhalarchit · January 30, 2021, 8:36am

self.P[:, :, 1::2] = torch.cos(X) in section 10.6.3 breaks if encoding_dim is odd number.

Zhaowei_Wang · April 2, 2021, 12:24pm

This chapter says that the the maximum path length of CNN is \mathcal{O}(n/k). I think it should be \mathcal{O}(log_k n)(as below)

This blog also talks about the maximum path length:https://medium.com/analytics-vidhya/transformer-vs-rnn-and-cnn-18eeefa3602b

sanjaradylov · April 12, 2021, 8:54am

@Zhaowei_Wang O(log_k(n)) is the case of dilated convolutions, while the chapter discusses regular convolutions.

C_Janny · October 31, 2021, 8:08am

According to the description " Since the queries, keys, and values come from the same place, this performs self-attention", maybe the formula 10.6.1 should be y_i = f(x_i, (x_1, y_1), (x_2, y_2), …)? In my opinion, authors may make mistakes here.

chgwan · January 5, 2022, 12:40pm

I think you are correct.This is very likely author makes a mistake here. CNN can be regarded as a K-ary tree, where the number of leaf nodes is n $K^h=n$ . So the maximum path is $O(log_k(n))$ .

sanjaradylov · February 8, 2022, 12:15pm

@chgwan In Fig. 10.6.1, we need a path of length 4 (hence 4 convolutional layers) so that the last (5th) feature in the last layer can have a route to the first feature x_1. I guess the exact formula is math.ceil((n-1) / (math.floor(k/2))), which is O(n/k).

thainq · March 4, 2022, 2:07am

which the term “parallel computation” in this post means? does it means that all tokens in a sequence is computed at one?

yzzzzz · February 19, 2023, 9:00am

I wonder why positional encodings have to be added into X? Why not concatenation? And why it still works with addition?

pandalabme · September 11, 2023, 3:34am

My solutions to the exs: 11.6

JH.Lam · January 22, 2024, 8:00am

but I think it’s floor( n/k). eg. floor(5/3)=2, floor(7/3)=3, but it doesn’t work if n is even while k is odd