what the point of the first sorting?

        # Sort according to frequencies
        counter = count_corpus(tokens)
        self.token_freqs = sorted(counter.items(), key=lambda x: x[0])
        self.token_freqs.sort(key=lambda x: x[1], reverse=True)

I think the first sort is unnecessary. The sorting code can be simplified as

        self.token_freqs = sorted(counter.items(), key=lambda x: x[1], reverse=True)
pip install -U git+https://github.com/d2l-ai/d2l-en.git@master
will help!

I agree. Fixing: https://github.com/d2l-ai/d2l-en/pull/1543

Why tocken with frequency > min_freq is a unique token??

uniq_tokens += [token for token, freq in self.token_freqs
if freq >= min_freq and token not in uniq_tokens]

I was wondering if we can use nltk package for this.

  1. TreeBankWordTokenizer: It separates the word using spaces and punctuation.
  2. PunktWordTokenizer: It does not separate punctuation from the word
  3. WordPunktTokenizer: It separate punctuation from the word
