The Dataset for Pretraining BERT

In bert, replace the word with a random word instead of random number.

red line should be made:
vocab.to_tokens(random.randint(0, len(vocab) - 1))

1 Like

1 Like

I agree. Contribute if you like.

Hi @HeartSea15, feel free to post a PR as @StevenJokess suggested! Thanks!

in masked LM task, should input token whether to consider ## ? in our code, there’s no part of it.

hi, the value of max_mum_mlm_preds according to paper followed or what?
thanks reply.


In BERT paper, Page7
But the deep reason of 15%?

Is the two 15% correspondence to the paper ???

1 Like

I saw your dataset in the attachment. I also need to build the customized dataset to pretrain for our model. But I don’t understand your way a lot. Could you tell me more detail or provide some sample explanation about your dataset? I’m fresh in this area.