I have one clarification question. Is the dimension $d$ mentioned in Section 8.4.3 equal to the size of the vocabulary and the result of one-hot encoding the input words?

It all depends on how you define your embedding. One-hot encoding all of the input words definitely is an approach. However, itās an inefficient way of creating a word embedding and we would end up with too many dimensions. Not only that, it doesnāt capture similarities between words and we would have to learn the whole language from scratch.

https://www.tensorflow.org/tutorials/text/word_embeddings

Hereās a tutorial on word embeddings and shows how to train your own word embeddings. In practice though, we would use a pre-trained embedding.

Isnāt *geometric mean* the correct term in the sentence

Perplexity can be best understood as the harmonic mean of the number of real choices that we have when deciding which token to pick next

Now I guess you are right about geometric mean.

Try to PR!

The harmonic mean corresponds to perplexity in the following cases, respectively:

best case: n/(1 + ā¦ + 1) = 1

baseline case: n(1/n+ ā¦ 1/n) = n

worst case: n/(0 + ā¦ + 0) = inf

Can someone explain to me how compression factors in with RNNS?

Is it mentioned that for a good language model, compression can be acheived.

Can someone explain to me how compression factors in with RNNS?

Is it mentioned that for a good language model, compression can be acheived.I think the word ācompressionā may be replaced with word āencodingā. In the sense where, less bits of informations are required to describe the relations between words int the language.

This is the case when you want to describe days of week. You can use 7 digits, each one for monday,ā¦, sunday (one hot encoding) to encode a day of the week, or, you may choose to use only 3 bits when binary encoded (2^3 = 8 that is enough to describe 7 days: moday = 000, tuesday=001,ā¦) to ācompressā the days in a week.

In the context of the evaluation of perplexity in section 8.4.4, where the term ācompressionā as been used, I think that a RNN model performance is measured to its ability to describe a sequence of words with less information as possible, then, to be able to compress such sequence the more efficiently as possible. This way, ācompressionā may be regarded as āefficient abstractionā.

And in this direction, DNA, as a sequence informations encoder, may be regarded as an efficient abstraction of the living

Is there a place I can find solutions? I would like to check my answers to the exercises. Thanks!