Text Preprocessing


Um, I dont think we have the right pip installs to run this section, the second cell (the first cell after the pip cells) did not work for me right off the bat, but I dont know exactly how I fixed it, I just took the pip’s from the last section and put em in and it ran. To hear about how you guys fixed and and why it works would also be helpful if youre willing :slight_smile:

Hi @smizerex, can you attach a code/error snap for us to reproduce the error?

Yeah, give me a sec.

Okay, so when I leave the code like this:

I get this:

But when I copy this from 8.1:

BOOM, it works:

Hi @smizerex, please make sure your local d2l notebooks folder is up-to-date with our github. You can execute the following code on your terminal to rebase your d2l notebooks:

git fetch origin master
git rebase origin/master

:thinking: :thinking: How could it have gotten out of sync in the first place?

what the point of the first sorting?

        # Sort according to frequencies
        counter = count_corpus(tokens)
        self.token_freqs = sorted(counter.items(), key=lambda x: x[0])
        self.token_freqs.sort(key=lambda x: x[1], reverse=True)

I think the first sort is unnecessary. The sorting code can be simplified as

        self.token_freqs = sorted(counter.items(), key=lambda x: x[1], reverse=True)
1 Like

pip install -U git+https://github.com/d2l-ai/d2l-en.git@master
will help!

1 Like

I agree. Fixing: https://github.com/d2l-ai/d2l-en/pull/1543

Why tocken with frequency > min_freq is a unique token??

uniq_tokens += [token for token, freq in self.token_freqs
if freq >= min_freq and token not in uniq_tokens]

I was wondering if we can use nltk package for this.

  1. TreeBankWordTokenizer: It separates the word using spaces and punctuation.
  2. PunktWordTokenizer: It does not separate punctuation from the word
  3. WordPunktTokenizer: It separate punctuation from the word
  1. It decreased exponentially