For the second question could we say that it is greedy search? To get interesting results from the character language model, could we use np.random.choice according to the softmax probability distribution at the output layer

That is a good idea!

I have a question on the implementation of exercise 2.

For simplification, take step2 as an example (assume K=2): we need to calculate the P(A,y2∣c) and P(C,y2∣c). However, the context c is dependent on the previous decoder input. I.e, the choice of A over C from step 1 will result in different hidden states, and thus different context c.

So, is it necessary to load and restore the whole model internal state (variables) seperately for the calculation of P(A,y2∣c) and P(C,y2∣c) respectively ?

Many thanks !

After equation 9.8.4 the following is mentioned:

Since a longer sequence has more logarithmic terms in the summation of (9.8.4), the term in the denominator penalizes long sequences.

Given the use of the word penalizes this seems to assume that the terms been added are positive and thus that the summation will be larger for longer sequences all else equal. However, the logarithm of a probability will be negative. So longer sequences will sum more negative terms which will make the summation smaller all else equal. This is quite evident if we think of the original multiplication of probabilities (before taking log); the more terms between 0 and 1 you multiply the smaller the result will be. My point is that 1/L^alpha is actually aiding longer sequences, not penalizing them, by making the result less negative. So I think the wording could be improved here. If you agree I’d be happy to start a pull request with a proposal.