Machine Translation and the Dataset

https://d2l.ai/chapter_recurrent-modern/machine-translation-and-dataset.html

In preprocess_nmt, should we remove question mark ‘?’ ?

Did you mean that ‘?’ in ‘who?’?
I guess, you are right.
@Songlin_Zheng

Hi @Songlin_Zheng, great catch! We will make the improvement with the “?” on both mxnet and torch version.

when running this cell:
#@save
d2l.DATA_HUB[‘fra-eng’] = (d2l.DATA_URL + ‘fra-eng.zip’,
‘94646ad1522d915e7b0f9296181140edcf86a4f5’)

#@save
def read_data_nmt():
“”“Load the English-French dataset.”""
data_dir = d2l.download_extract(‘fra-eng’)
with open(os.path.join(data_dir, ‘fra.txt’), ‘r’) as f:
return f.read()

raw_text = read_data_nmt()
print(raw_text[:75])

the following error occured:
UnicodeDecodeError: ‘gbk’ codec can’t decode byte 0xaf in position 33: illegal multibyte sequence

I fixed this by modify this line:
with open(os.path.join(data_dir, ‘fra.txt’), ‘r’) as f:
to:
with open(os.path.join(data_dir, ‘fra.txt’), ‘r’, encoding = ‘utf-8’) as f:

Who can explain “the histogram of the number of tokens per text sequence” to me?
I didn’t understand how the picture was drawn.

I think truncate_pad() should add ‘eos’ when truncate is performed:

def truncate_pad(line, num_steps, padding_token, eos_token):
    if len(line) > num_steps:
        return line[:num_steps-1] + [eos_token]
    return line + [padding_token] * (num_steps - len(line))

otherwise Truncate_pad() may delete ‘eos’.