The Transformer Architecture

astonzhang · September 17, 2020, 4:54am

https://d2l.ai/chapter_attention-mechanisms-and-transformers/transformer.html

HeartSea15 · September 21, 2020, 3:29am

input is 3 dimensions, why does norm_shape use the last two dimensions of the input in the example ,but the last one in the final trainning. normalized_shape is input.size()[1:], but in the trainning, normalized_shape is input.size()[-1]. what’s the difference? why change?

HeartSea15 · September 21, 2020, 3:31am

HeartSea15 · September 21, 2020, 3:31am

StevenJokess · October 7, 2020, 12:47pm

What’s your IDE?What’s your IDE?
@HeartSea15

HeartSea15 · October 9, 2020, 2:39am

I used pytorch. May I ask you a question about two different methods? Mxnet’s method is right and wrong in pytorch. The following changes should be made.

goldpiggy · October 9, 2020, 6:27pm

Hi @HeartSea15, thanks for your suggestion! Could you make a PR to our github?

foreverlms · October 19, 2020, 8:17am

Can LN be done in a single dim? such as tensor with shape [2,3,4], could the LN be done in norm_shape=shape[1] (3)?

goldpiggy · October 22, 2020, 5:58pm

Hi @foreverlms, great question. Yes Layernorm can be done at a single dim, which will be the last dimension. See more details at pytorch documentation: “If a single integer is used, it is treated as a singleton list, and this module will normalize over the last dimension which is expected to be of that specific size.”

Pratik_Pratik · February 5, 2021, 9:15am

Book content is very helpful for those who want to learn Deep learning from scratch however
request you to please add more graphical presentation / images. It will helpful to understand concept easily.

min_xu · April 16, 2021, 7:17am

goldpiggy · April 18, 2021, 4:55am

Hi @min_xu, I am not fully understand your question… but in general you can imagine that the attention layer has seen a lot example start with “BOS”. As a result, it will predict “BOS” as the first token.

min_xu · April 19, 2021, 1:12am

thank you , i find the code process what i thinking. mask the data after the timesteps when the model is trainning

thank you !

ysnowman · July 6, 2021, 11:21pm

In decoder block there is

        self.attention = d2l.MultiHeadAttention(key_size, query_size,
                                                value_size, num_hiddens,
                                                num_heads, dropout, use_bias)
        ...
        self.ffn = PositionWiseFFN(ffn_num_input, ffn_num_hiddens,
                                   num_hiddens)

According to the paper Attention Is All You Need, the ffn has width [d_model, ?, d_model] where d_model is the embedding size of a token. But in the above code num_hiddens is used as both the embedding size and the embedding size in attention block (the dimensionality of an input after the linear map in multi-head attention). It is not the best way to implement it if I’m correct.

defo900 · July 15, 2021, 5:10pm

class DecoderBlock
...
def forward(...)
      ...
      if self.training:
          ...
          dec_valid_lens = torch.arange(1, num_steps + 1, device=X.device).repeat(batch_size, 1)  
      else:
            dec_valid_lens = None

I’m not sure this is true. In both cases (train and inference) decoder self-attention needs two masks: padding and causal (other possible names: look-ahead or auto-regression). For convenience, they are usually combined into one tensor during calculations.

yuezih · November 6, 2021, 2:48pm

Hi @HeartSea15, are you using PyCharm? Awesome color scheme, may I know how to get it? Thanks.

LilNeo · January 22, 2022, 6:08pm

When predicting, target tokens are feed into the Decoder one by one. At this point the num_steps is considered to be 1, that is, each time you enter a tensor of the shape torch.Size([1, 1, num_hiddens]), this will result in the position encoding of each token being the same, i.e., the position encoding fails at prediction

LilNeo · January 22, 2022, 6:08pm

the reason of multiplying each element by sqrt(num_hiddens) is the use of the Xavier_unifom_ to initialize parameters of the Embedding, each element is very small compared to 1/-1. (see the following figure) ,but no Xavier_unifom_ is used here…

Reference：http://nlp.seas.harvard.edu/2018/04/03/attention.html

LilNeo · January 22, 2022, 6:09pm

guan-zi · February 6, 2022, 12:39pm

**这段代码Add&Norm不懂，求教

**