I have one clarification question. Is the dimension $d$ mentioned in Section 8.4.3 equal to the size of the vocabulary and the result of one-hot encoding the input words?

It all depends on how you define your embedding. One-hot encoding all of the input words definitely is an approach. However, it’s an inefficient way of creating a word embedding and we would end up with too many dimensions. Not only that, it doesn’t capture similarities between words and we would have to learn the whole language from scratch.

https://www.tensorflow.org/tutorials/text/word_embeddings

Here’s a tutorial on word embeddings and shows how to train your own word embeddings. In practice though, we would use a pre-trained embedding.

Isn’t *geometric mean* the correct term in the sentence

Perplexity can be best understood as the harmonic mean of the number of real choices that we have when deciding which token to pick next

Now I guess you are right about geometric mean.

Try to PR!

The harmonic mean corresponds to perplexity in the following cases, respectively:

best case: n/(1 + … + 1) = 1

baseline case: n(1/n+ … 1/n) = n

worst case: n/(0 + … + 0) = inf

Can someone explain to me how compression factors in with RNNS?

Is it mentioned that for a good language model, compression can be acheived.

Can someone explain to me how compression factors in with RNNS?

Is it mentioned that for a good language model, compression can be acheived.I think the word ‘compression’ may be replaced with word ‘encoding’. In the sense where, less bits of informations are required to describe the relations between words int the language.

This is the case when you want to describe days of week. You can use 7 digits, each one for monday,…, sunday (one hot encoding) to encode a day of the week, or, you may choose to use only 3 bits when binary encoded (2^3 = 8 that is enough to describe 7 days: moday = 000, tuesday=001,…) to ‘compress’ the days in a week.

In the context of the evaluation of perplexity in section 8.4.4, where the term ‘compression’ as been used, I think that a RNN model performance is measured to its ability to describe a sequence of words with less information as possible, then, to be able to compress such sequence the more efficiently as possible. This way, “compression” may be regarded as “efficient abstraction”.

And in this direction, DNA, as a sequence informations encoder, may be regarded as an efficient abstraction of the living

Is there a place I can find solutions? I would like to check my answers to the exercises. Thanks!

I am unable to understand perplexity. Can anyone suggest some reference article or material to understand it better?

Not really sure it is the best way to introduce the input and output data with batch dimension. I think the concept of mini batch should be added later since it is related to model training. Maybe it is better to introduce the idea on a higher level without the batch dimension so that the reader can focus on the concept level. This can reduce the complexity of directly coping with all dimensions.

The bright side is for reader already with some knowledge of deep learning. They can understand why tensors are created with so many dimensions and can get a better picture of the actual input/output in the practice.

I believe this should say geometric mean too.

perplexity = exp(-1/n * sum_t log(p_t))

= exp(sum_t log(p_t^(-1/n)))

= prod_t exp(log(p_t^(-1/n)))

= prod_t p_t^(-1/n)

= prod_t (1/p_t)^(1/n)

So the perplexity is the geometric mean of the multiplicative inverse of probabilities.

@astonzhang, the geometric mean is also going to satisfy the mentioned examples:

best case: 1^(1/n)*… 1^(1/n) = 1*(1/(1/n))^(1/n) = n^(1/n)

baseline case: (1/(1/n))^(1/n)…

*…*(1/0)^(1/n) = inf

*n^(1/n) = n*

worst case: (1/0)^(1/n)…worst case: (1/0)^(1/n)

The harmonic mean gave the same result for these examples but it is not the case in general.

I started a pull request to change harmonic mean to geometric mean in the text. The whole sentence would look like: “Perplexity can be best understood as the geometric mean of the number of real choices that we have when deciding which token to pick next”.

I’m not sure if I would call the multiplicative inverse of the probability the number of real choices. It is definitively true for the homogeneous case, but I’m not sure it is a good depiction in general. Any thoughts?