Transformer

https://d2l.ai/chapter_attention-mechanisms/transformer.html

input is 3 dimensions, why does norm_shape use the last two dimensions of the input in the example ,but the last one in the final trainning. normalized_shape is input.size()[1:], but in the trainning, normalized_shape is input.size()[-1]. what’s the difference? why change?

What’s your IDE?What’s your IDE?
@HeartSea15

I used pytorch. May I ask you a question about two different methods? Mxnet’s method is right and wrong in pytorch. The following changes should be made.

Hi @HeartSea15, thanks for your suggestion! Could you make a PR to our github?

Can LN be done in a single dim? such as tensor with shape [2,3,4], could the LN be done in norm_shape=shape[1] (3)?

Hi @foreverlms, great question. Yes Layernorm can be done at a single dim, which will be the last dimension. See more details at pytorch documentation: “If a single integer is used, it is treated as a singleton list, and this module will normalize over the last dimension which is expected to be of that specific size.

Book content is very helpful for those who want to learn Deep learning from scratch however
request you to please add more graphical presentation / images. It will helpful to understand concept easily.

1 Like

Hi @min_xu, I am not fully understand your question… but in general you can imagine that the attention layer has seen a lot example start with “BOS”. As a result, it will predict “BOS” as the first token.

thank you , i find the code process what i thinking. mask the data after the timesteps when the model is trainning


thank you !

In decoder block there is

        self.attention = d2l.MultiHeadAttention(key_size, query_size,
                                                value_size, num_hiddens,
                                                num_heads, dropout, use_bias)
        ...
        self.ffn = PositionWiseFFN(ffn_num_input, ffn_num_hiddens,
                                   num_hiddens)

According to the paper Attention Is All You Need, the ffn has width [d_model, ?, d_model] where d_model is the embedding size of a token. But in the above code num_hiddens is used as both the embedding size and the embedding size in attention block (the dimensionality of an input after the linear map in multi-head attention). It is not the best way to implement it if I’m correct.

class DecoderBlock
...
def forward(...)
      ...
      if self.training:
          ...
          dec_valid_lens = torch.arange(1, num_steps + 1, device=X.device).repeat(batch_size, 1)  
      else:
            dec_valid_lens = None

I’m not sure this is true. In both cases (train and inference) decoder self-attention needs two masks: padding and causal (other possible names: look-ahead or auto-regression). For convenience, they are usually combined into one tensor during calculations.