Bahdanau Attention

Because the last hidden state of the encoder serves as the initial hidden state of the decoder: $$s_0$$.

My solutions to the exs: 11.4

1 Like

How did you change from GRU to LSTM? I got an error!
Thanks for your help.

In this chapter, I got confused when comparing the implementation with the chapter 10.7. Specifically, this chapter uses a for loop in decoder step to predict the next token using the previous token hidden state instead of parallel computing as in the previous chapter. This may leads to significant higher training time.