Pretraining BERT

astonzhang · June 29, 2020, 10:41pm

https://d2l.ai/chapter_natural-language-processing-pretraining/bert-pretraining.html

HeartSea15 · October 19, 2020, 7:02am

peng · November 17, 2020, 1:06pm

In the experiment, we can see that the masked language modeling loss is significantly higher than the next sentence prediction loss. Why?

is it because the MLM task much more difficult than the NSP task?

astonzhang · November 20, 2020, 6:27pm

Thanks. This is just a tunable hyperparameter. For demonstration purpose, we define a small BERT and set it to 2H so users can run it locally and quickly see the results.

astonzhang · November 20, 2020, 6:28pm

Hint: NSP is binary classification. How about MLM?

peng · November 24, 2020, 11:46am

MLM is a multi classification task and the vocab_size is big. In the loss function -logP corresponding to the label could not be optimized as small as in the binary classification task. That’s why the MLM loss much lager than that of the NSP task. Is it right? @astonzhang