Bahdanau Attention


In chapter 10.4.1:

  • where the decoder hidden state st′−1st′−1 at time step t′−1t′−1 is the query, and the encoder hidden states htht are both the keys and values,

However, in the code implementation:

  • context = self.attention(
    query, enc_outputs, enc_outputs, enc_valid_lens)

I think that the keys and values are enc_outputs instead of the encoder hidden states h_t. Is it a mistake?
Please correct me if I am wrong!

Yes,I’m confused,too!

I think you have already pointed out the difference: s_{t’-1} is the decoder hidden state. It is the input of
" out, hidden_state = self.rnn(x.permute(1, 0, 2), hidden_state) "
I don’t think the encoder hidden state h_t is used as a decoder hidden state. It is clearly stated that h_t is a key.

in my opinion

I wonder why there is an arrow pointing from the recurrent layer of the encoder towards to recurrent layer of the decoder. I thought the context variable is already handled by the attention mechanism?

I find out that we can save a lot of GPU memory by using:


in Seq2SeqAttentionDecoder class, forward function when we loop for each num_steps.
More information: Michael Jungo answer

Change from adaptive attention to dot product attention will increase the training speed. Change from GRU to LSTM will decrease the training speed.