Thanks. This is just a tunable hyperparameter. For demonstration purpose, we define a small BERT and set it to 2H so users can run it locally and quickly see the results.
MLM is a multi classification task and the vocab_size is big. In the loss function -logP corresponding to the label could not be optimized as small as in the binary classification task. That’s why the MLM loss much lager than that of the NSP task. Is it right? @astonzhang