bert 使用古腾堡数据集在NSP任务上评估预训练模型的评估结果,

ncgqoxb0 于 4个月前发布在其他

关注(0)|答案(1)|浏览(60)

我们正在尝试在古腾堡(约3000本书)的数据上评估预训练模型，但我们无法接近论文中实现的97-98%的Next Sentence Prediction(NSP)任务结果。
我们进行了一些基本的数据预处理，将每个句子放在单独的一行上，段落之间用空行隔开。然后我们运行了create_pretraining_data.py和run_pretraining.py,其中我们设置了--do_train=False(请参阅下面的脚本)。这样我们实现了大约78%的下一个句子预测任务准确率，比你在论文中实现的结果低20%。
我们的方法正确吗？还是我们做错了什么？
你能分享一下你用于预处理BookCorpus数据集中书籍的算法吗？也许我们预处理数据的方式导致了这种差异。
python create_pretraining_data.py --input_file=../Gutenberg/sentence_per_line/*.txt --output_file=../Gutenberg/output/tokenized_tf_examples.tfrecord --vocab_file=$BERT-BASE/vocab.txt --do_lower_case=True --max_seq_length=128 --max_predictions_per_seq=20 --masked_lm_prob=0.15 --random_seed=12345 --dupe_factor=5
python run_pretraining.py --input_file=../Gutenberg/output/tokenized_tf_examples.tfrecord --output_dir=../Gutenberg/output/pretraining_output --do_train=False --do_eval=True --bert_config_file=$BERT-BASE/bert_config.json --init_checkpoint=$BERT-BASE/bert_model.ckpt --train_batch_size=32 --max_seq_length=128 --max_predictions_per_seq=20 --num_train_steps=20 --num_warmup_steps=10 --learning_rate=2e-5

bert

来源：https://github.com/google-research/bert/issues/380