Recurrent Neural Networks

https://d2l.ai/chapter_recurrent-neural-networks/rnn.html

I have one clarification question. Is the dimension $d$ mentioned in Section 8.4.3 equal to the size of the vocabulary and the result of one-hot encoding the input words?

It all depends on how you define your embedding. One-hot encoding all of the input words definitely is an approach. However, it’s an inefficient way of creating a word embedding and we would end up with too many dimensions. Not only that, it doesn’t capture similarities between words and we would have to learn the whole language from scratch.

https://www.tensorflow.org/tutorials/text/word_embeddings

Here’s a tutorial on word embeddings and shows how to train your own word embeddings. In practice though, we would use a pre-trained embedding.

2 Likes

Isn’t geometric mean the correct term in the sentence

Perplexity can be best understood as the harmonic mean of the number of real choices that we have when deciding which token to pick next

Now I guess you are right about geometric mean.
Try to PR!

The harmonic mean corresponds to perplexity in the following cases, respectively:

best case: n/(1 + … + 1) = 1
baseline case: n(1/n+ … 1/n) = n
worst case: n/(0 + … + 0) = inf

1 Like

Can someone explain to me how compression factors in with RNNS?
Is it mentioned that for a good language model, compression can be acheived.

Can someone explain to me how compression factors in with RNNS?
Is it mentioned that for a good language model, compression can be acheived.

I think the word ‘compression’ may be replaced with word ‘encoding’. In the sense where, less bits of informations are required to describe the relations between words int the language.

This is the case when you want to describe days of week. You can use 7 digits, each one for monday,…, sunday (one hot encoding) to encode a day of the week, or, you may choose to use only 3 bits when binary encoded (2^3 = 8 that is enough to describe 7 days: moday = 000, tuesday=001,…) to ‘compress’ the days in a week.

In the context of the evaluation of perplexity in section 8.4.4, where the term ‘compression’ as been used, I think that a RNN model performance is measured to its ability to describe a sequence of words with less information as possible, then, to be able to compress such sequence the more efficiently as possible. This way, “compression” may be regarded as “efficient abstraction”.

And in this direction, DNA, as a sequence informations encoder, may be regarded as an efficient abstraction of the living :slight_smile:

Is there a place I can find solutions? I would like to check my answers to the exercises. Thanks!