For the second question could we say that it is greedy search? To get interesting results from the character language model, could we use np.random.choice according to the softmax probability distribution at the output layer

That is a good idea!

I have a question on the implementation of exercise 2.

For simplification, take step2 as an example (assume K=2): we need to calculate the P(A,y2∣c) and P(C,y2∣c). However, the context c is dependent on the previous decoder input. I.e, the choice of A over C from step 1 will result in different hidden states, and thus different context c.

So, is it necessary to load and restore the whole model internal state (variables) seperately for the calculation of P(A,y2∣c) and P(C,y2∣c) respectively ?

Many thanks !

After equation 9.8.4 the following is mentioned:

Since a longer sequence has more logarithmic terms in the summation of (9.8.4), the term in the denominator penalizes long sequences.

Given the use of the word penalizes this seems to assume that the terms been added are positive and thus that the summation will be larger for longer sequences all else equal. However, the logarithm of a probability will be negative. So longer sequences will sum more negative terms which will make the summation smaller all else equal. This is quite evident if we think of the original multiplication of probabilities (before taking log); the more terms between 0 and 1 you multiply the smaller the result will be. My point is that 1/L^alpha is actually aiding longer sequences, not penalizing them, by making the result less negative. So I think the wording could be improved here. If you agree I’d be happy to start a pull request with a proposal.

The notation of equations ignore the order of words in the sentence.The joint probability only measure the probability of two events occuring together, the notation doesn’t take care of order.of word in the sentence, it is our responsibilty to define the event with notion of order.

Whie looking through the equations, it is quit confusing for me and gave me the feeling that the probability of “Hello World!” is the same as probability of “World Hello!”.

For example, equation 10.8.3 is better defined as P(S_1 = A, S_2 = B, S_3 = y_3 | C = c) = P(S_1 = A, S_2 = B | C = c) * P(S_3 = y_3 | S_1 = A, S_2 = B, C = c)

where S_1 is the event of value of first word of the sentence and S_2 as second word of the sentence, etc…

Same comment apply to 9.1. Working with Sequences — Dive into Deep Learning 1.0.0-alpha0 documentation and 9.3. Language Models — Dive into Deep Learning 1.0.0-alpha0 documentation.

yes, agreed. since a power of 3/4 will result in less than x. then the inverse operation 1/x^(3/4) will enlarge the longer sequences.