Pretraining BERT - mxnet - D2L Discussion

Oct '20

peng

In the experiment, we can see that the masked language modeling loss is significantly higher than the next sentence prediction loss. Why?

is it because the MLM task much more difficult than the NSP task?

1 reply

Nov '20 ▶ HeartSea15

astonzhang

Thanks. This is just a tunable hyperparameter. For demonstration purpose, we define a small BERT and set it to 2H so users can run it locally and quickly see the results.

Nov '20 ▶ peng

astonzhang

Hint: NSP is binary classification. How about MLM?

1 reply

Nov '20

MLM is a multi classification task and the vocab_size is big. In the loss function -logP corresponding to the label could not be optimized as small as in the binary classification task. That’s why the MLM loss much lager than that of the NSP task. Is it right? @astonzhang