The Transformer Architecture

min_xu · April 19, 2021, 1:12am

thank you , i find the code process what i thinking. mask the data after the timesteps when the model is trainning

thank you !

ysnowman · July 6, 2021, 11:21pm

In decoder block there is

        self.attention = d2l.MultiHeadAttention(key_size, query_size,
                                                value_size, num_hiddens,
                                                num_heads, dropout, use_bias)
        ...
        self.ffn = PositionWiseFFN(ffn_num_input, ffn_num_hiddens,
                                   num_hiddens)

According to the paper Attention Is All You Need, the ffn has width [d_model, ?, d_model] where d_model is the embedding size of a token. But in the above code num_hiddens is used as both the embedding size and the embedding size in attention block (the dimensionality of an input after the linear map in multi-head attention). It is not the best way to implement it if I’m correct.

defo900 · July 15, 2021, 5:10pm

class DecoderBlock
...
def forward(...)
      ...
      if self.training:
          ...
          dec_valid_lens = torch.arange(1, num_steps + 1, device=X.device).repeat(batch_size, 1)  
      else:
            dec_valid_lens = None

I’m not sure this is true. In both cases (train and inference) decoder self-attention needs two masks: padding and causal (other possible names: look-ahead or auto-regression). For convenience, they are usually combined into one tensor during calculations.

yuezih · November 6, 2021, 2:48pm

Hi @HeartSea15, are you using PyCharm? Awesome color scheme, may I know how to get it? Thanks.

LilNeo · January 22, 2022, 6:08pm

When predicting, target tokens are feed into the Decoder one by one. At this point the num_steps is considered to be 1, that is, each time you enter a tensor of the shape torch.Size([1, 1, num_hiddens]), this will result in the position encoding of each token being the same, i.e., the position encoding fails at prediction

LilNeo · January 22, 2022, 6:08pm

the reason of multiplying each element by sqrt(num_hiddens) is the use of the Xavier_unifom_ to initialize parameters of the Embedding, each element is very small compared to 1/-1. (see the following figure) ,but no Xavier_unifom_ is used here…

Reference：http://nlp.seas.harvard.edu/2018/04/03/attention.html

LilNeo · January 22, 2022, 6:09pm

guan-zi · February 6, 2022, 12:39pm

**这段代码Add&Norm不懂，求教

**

guan-zi · February 6, 2022, 1:12pm

看懂了，此处X先进行嵌入然后进行缩放，传入位置编码对象中，进行Add&Norm操作

davpr · January 14, 2023, 8:34pm

According to the paper (sec 3.4) the weights of the embedding layer are multiplied by \sqrt(d_{model}).

ubayram · March 26, 2023, 4:58pm

As we described earlier in this section, in the masked multi-head decoder self-attention (the first sublayer), queries, keys, and values all come from the outputs of the previous decoder layer…

I thought queries were from the previous decoder layer, while keys and values were from the last encoder layer. Is the above statement true?

Wubbbalubbadubdub · April 27, 2023, 2:57pm

Same doubt. Have you figured it out how to solve this problem? Or any more information could be provided for me to grasp a better understanding?
Thx

pandalabme · September 11, 2023, 6:08am

My solutions to the exs: 11.7

dingcurie · September 26, 2023, 1:52am

Since we use the fixed positional encoding whose values are always between -1 and 1, we multiply values of the learnable input embeddings by the square root of the embedding dimension to rescale before summing up the input embedding and the positional encoding.

It makes sense intuitively that we should make a balance between the magnitudes of the input embedding and the positional encoding before adding them. But why should we scale the input embedding by exactly the square root of its dimension?

Bo_Yang · November 27, 2023, 3:08am

The second sublayer uses the encoder output as keys and values, while the first sublayer uses q, k, and v from the previous decoder layer.

Bo_Yang · November 27, 2023, 3:38am

Hi,

I do not understand the state[2][i] very much in the TransformerDecoder. I think it just concatenates all of the past queries and the current query of the first attention sublayer at the i-th decoder block. Why is it needed? Why not just use the current query? Thank you very much!

Bo_Yang · November 27, 2023, 5:14pm

the reason of multiplying each element by sqrt(num_hiddens) is the use of the Xavier_unifom_ to initialize parameters of the Embedding, each element is very small compared to 1/-1.

I did not find a statement indicating that the reason is the use of the Xavier_uniform from the post you linked. It does not explain the reason. I think your screenshot just shows that the model uses Xavier_unifom_ to initialize the model’s weights.

Below is a screenshot of this part in the link you provided.

lycong · December 15, 2023, 8:35am

I also not understand state[2][i] in DecoderBlock. Why concat X and state[2][I] to become key_values and and feed concat result to MHAttention(X, key_values, key_values)?

JH.Lam · February 19, 2024, 2:49pm

the logistic in both training step and prediction step are coherent, ie. input ‘token i’ against kv pairs “[bos,token 2,…, token i-1, token i” to predict ‘token i+1’

JH.Lam · February 19, 2024, 4:14pm

I think it means that the self-attention tries hard to reuse the predicted historical tokens/features ( or say ‘last context’ but not entire context ) to give an insight