加载手动注释的数据以训练RNN POS标记器

问题描述

我有大量的手动注释数据。我想使用RNN训练部分语音标记器。数据类似于下面的文本：

Lorem <NP> Ipsum <NP> dummy <N> text <ADV> printing <VREL> typesetting <NUMCR> Ipsum <VREL> Ipsum <NP> Ipsum <NP> Lorem <N> Ipsum <NP> Ipsum <N> Ipsum <NP> Lorem <ADJ> Lorem <NP> Ipsum <N> Lorem <VN> Lorem <ADJ> Lorem <N> Lorem <N> ፣ <PUNC> Lorem <ADJ> Lorem <ADJ> Ipsum <NC> Ipsum <NC> Ipsum <NP>

请指导我如何加载此数据以训练基于RNN的标记器。

解决方法

要阅读此文件，建议您将其转换为tsv文件，并以空白行（也称为conll格式）分隔示例，如下所示：

src_fp,tgt_fp = "source/file/path.txt","target/file/path.tsv"
with open(src_fp) as src_f:
    with open(tgt_fp,'w') as tgt_f:    
        for line in src_f:
            words = list(line.split(' '))[0::2]
            tags = list(line.split(' '))[1::2]
            for w,t in zip(words,tags):
                tgt_f.write(w+'\t'+t+'\n')
                tgt_f.write('\n')

然后，您就可以使用torchtext.datasets中的SequenceTaggingDataset进行如下读取：

text_field,label_field = data.Field(),data.Field()
pos_dataset = torchtext.datasets.SequenceTaggingDataset(
        path='data/pos/pos_wsj_train.tsv',fields=[('text',text_field),('labels',label_field)])

最后一步是创建词汇表并对数据进行迭代：

text_field.build_vocab(pos_dataset)
train_iter = data.BucketIterator.splits(
            (unsup_train,unsup_val,unsup_test),batch_size=MY_BATCH_SIZE,device=MY_DEVICE)
# using the iterator
for ex in self train_iter:
    train(ex.text,ex.labels)

我建议您花些时间阅读有关上面使用的功能的文档，以便您可以根据需要进行调整（最大词汇量，是否改组示例，序列长度等）。要构建带有分类的RNN，the official pytorch tutorial很容易学习。因此，我建议您从此处开始，将网络输入和输出从序列分类（每个文本范围1个标签）调整为序列标记（每个标记1个标签）。

deep-learning part-of-speech pytorch recurrent-neural-network