在训练自定义BERT模型时，create_pretraining_data.py将0条记录写入tf

问题描述

我正在自己的语料库上编写自定义BERT模型，我使用BertWordPiecetokenizer生成了vocab文件，然后在以下代码下运行

//pointer to pointer to char,has no access to any memory
char **mptr;

//allocates memory for 10 pointers to char
mptr = calloc(10,sizeof(char*));

//allocates memory for each of the 10 mptr[i] pointers to point to
for (i = 0; i < 10; i++)
{
    mptr[i] = malloc(10); //no cast needed,#include <stdlib.h>
}

获取输出为：

!python create_pretraining_data.py --input_file=/content/drive/My Drive/internet_archive_scifi_v3.txt --output_file=/content/sample_data/tf_examples.tfrecord --vocab_file=/content/sample_data/sifi_13sep-vocab.txt --do_lower_case=True --max_seq_length=128 --max_predictions_per_seq=20 --masked_lm_prob=0.15 --random_seed=12345 --dupe_factor=5

INFO:tensorflow:*** Reading from input files ***

INFO:tensorflow:*** Writing to output files ***

INFO:tensorflow: /content/sample_data/tf_examples.tfrecord

不确定为什么我总是在INFO:tensorflow:Wrote 0 total instances中获得0个实例，我在做错什么吗？

我正在使用tf_examples.tfrecord FYI ..生成的vocab文件为290 KB。

解决方法

无法读取输入文件，请使用“My\ Drive”代替“My Drive”

--input_file=/content/drive/My\ Drive/internet_archive_scifi_v3.txt

bert-language-model google-natural-language nlp tensorflow

在训练自定义BERT模型时，create_pretraining_data.py将0条记录写入tf_examples.tfrecord

问题描述

解决方法