For the second question could we say that it is greedy search? To get interesting results from the character language model, could we use np.random.choice according to the softmax probability distribution at the output layer

That is a good idea!

I have a question on the implementation of exercise 2.

For simplification, take step2 as an example (assume K=2): we need to calculate the P(A,y2∣c) and P(C,y2∣c). However, the context c is dependent on the previous decoder input. I.e, the choice of A over C from step 1 will result in different hidden states, and thus different context c.

So, is it necessary to load and restore the whole model internal state (variables) seperately for the calculation of P(A,y2∣c) and P(C,y2∣c) respectively ?

Many thanks !

After equation 9.8.4 the following is mentioned:

Since a longer sequence has more logarithmic terms in the summation of (9.8.4), the term in the denominator penalizes long sequences.

Given the use of the word penalizes this seems to assume that the terms been added are positive and thus that the summation will be larger for longer sequences all else equal. However, the logarithm of a probability will be negative. So longer sequences will sum more negative terms which will make the summation smaller all else equal. This is quite evident if we think of the original multiplication of probabilities (before taking log); the more terms between 0 and 1 you multiply the smaller the result will be. My point is that 1/L^alpha is actually aiding longer sequences, not penalizing them, by making the result less negative. So I think the wording could be improved here. If you agree I’d be happy to start a pull request with a proposal.

The notation of equations ignore the order of words in the sentence.The joint probability only measure the probability of two events occuring together, the notation doesn’t take care of order.of word in the sentence, it is our responsibilty to define the event with notion of order.

Whie looking through the equations, it is quit confusing for me and gave me the feeling that the probability of “Hello World!” is the same as probability of “World Hello!”.

For example, equation 10.8.3 is better defined as P(S_1 = A, S_2 = B, S_3 = y_3 | C = c) = P(S_1 = A, S_2 = B | C = c) * P(S_3 = y_3 | S_1 = A, S_2 = B, C = c)

where S_1 is the event of value of first word of the sentence and S_2 as second word of the sentence, etc…

Same comment apply to 9.1. Working with Sequences — Dive into Deep Learning 1.0.0-alpha0 documentation and 9.3. Language Models — Dive into Deep Learning 1.0.0-alpha0 documentation.

yes, agreed. since a power of 3/4 will result in less than x. then the inverse operation 1/x^(3/4) will enlarge the longer sequences.

Good point. Another way to see it, easier for me, is un-loging it (exponentiating it). That shows that this is a multiplication of probabilities, so the longer the sequence the smaller the result, so close to zero that it would underflow, and then we are taking the Lth root of this tiny number, which ‘inflates’ it back again, the bigger Lis, the larger the MLE result.

So, L^alpha is not penalizing long sequences, making the hyperparameter alpha smaller than .75 does.

Even though the loss equation ignores the order of the tokens (multiplication is commutative), the conditional probabilities are not, p(world|hello) >> p(hello|world)

I think the ‘penalizing’ means the fairness b/w short and long predictions, ie. average on items.

I’m using this section in a course I’m teaching with students who I just taught conditional probability to in preparation for their working through this material.

As I expand on $\prod_{t’=1}^{T’} P(y_{t’} \mid y_1, \ldots, y_{t’-1}, \mathbf{c})$ in Section 10.8.1, I find I must change the range to $\prod_{t’=l+1}^{l+T’} P(y_{t’}$, as $y_{t’}=y_1$ is already used for an earlier term here.

Would a pull request to make this edit be helpful?

Yes, you are right. Wanted to point out the same)