For the second question could we say that it is greedy search? To get interesting results from the character language model, could we use np.random.choice according to the softmax probability distribution at the output layer

That is a good idea!

I have a question on the implementation of exercise 2.

For simplification, take step2 as an example (assume K=2): we need to calculate the P(A,y2∣c) and P(C,y2∣c). However, the context c is dependent on the previous decoder input. I.e, the choice of A over C from step 1 will result in different hidden states, and thus different context c.

So, is it necessary to load and restore the whole model internal state (variables) seperately for the calculation of P(A,y2∣c) and P(C,y2∣c) respectively ?

Many thanks !