The Dataset for Pretraining BERT

astonzhang · June 29, 2020, 10:41pm

https://d2l.ai/chapter_natural-language-processing-pretraining/bert-dataset.html

HeartSea15 · October 12, 2020, 2:24pm

In bert, replace the word with a random word instead of random number.

red line should be made:
vocab.to_tokens(random.randint(0, len(vocab) - 1))

HeartSea15 · October 13, 2020, 9:42am

StevenJokess · October 14, 2020, 9:54am

I agree. Contribute if you like.
https://d2l.ai/chapter_appendix-tools-for-deep-learning/contributing.html
@goldpiggy

goldpiggy · October 15, 2020, 4:07am

Hi @HeartSea15, feel free to post a PR as @StevenJokess suggested! Thanks!

HeartSea15 · October 19, 2020, 3:13pm

in masked LM task, should input token whether to consider ## ? in our code, there’s no part of it.

HeartSea15 · October 20, 2020, 5:48am

hi, the value of max_mum_mlm_preds according to paper followed or what?
thanks reply.

StevenJokess · November 29, 2020, 10:12am

In BERT paper, https://arxiv.org/abs/1810.04805 Page7

But the deep reason of 15%?

HeartSea15 · July 5, 2021, 2:53am

Is the two 15% correspondence to the paper ???

AliceNguyen · July 6, 2022, 8:33am

I saw your dataset in the attachment. I also need to build the customized dataset to pretrain for our model. But I don’t understand your way a lot. Could you tell me more detail or provide some sample explanation about your dataset? I’m fresh in this area.