如何使用我自己的训练集训练文本蕴涵模型?

问题描述

我想用我自己的数据集在演示中训练 decomposable attention + ELMo; SNLI 模型。我是 nlp 的新手。在经历了 guide 之后,我仍然不知道如何从我自己的由纯文本前提、假设和标签组成的训练集开始。数据格式如下所示。

根据demo上的训练命令,我发现它的训练集是https://allennlp.s3.amazonaws.com/datasets/snli/snli_1.0_train.jsonl。如何使用自己的数据生成这样的训练集?

仅供参考。 我的数据集是这样的:

{ "premise":"sentences","hypothesis":"sentences","label":"x"}
{ "premise":"sentences","label":"y"}
...

snli_1.0_train.jsonl 中的条目类似于:

{"annotator_labels": ["neutral"],"captionID": "3416050480.jpg#4","gold_label": "neutral","pairID": "3416050480.jpg#4r1n","sentence1": "A person on a horse jumps over a broken down airplane.","sentence1_binary_parse": "( ( ( A person ) ( on ( a horse ) ) ) ( ( jumps ( over ( a ( broken ( down airplane ) ) ) ) ) . ) )","sentence1_parse": "(ROOT (S (NP (NP (DT A) (NN person)) (PP (IN on) (NP (DT a) (NN horse)))) (VP (VBZ jumps) (PP (IN over) (NP (DT a) (JJ broken) (JJ down) (NN airplane)))) (. .)))","sentence2": "A person is training his horse for a competition.","sentence2_binary_parse": "( ( A person ) ( ( is ( ( training ( his horse ) ) ( for ( a competition ) ) ) ) . ) )","sentence2_parse": "(ROOT (S (NP (DT A) (NN person)) (VP (VBZ is) (VP (VBG training) (NP (PRP$ his) (NN horse)) (PP (IN for) (NP (DT a) (NN competition))))) (. .)))"}

如果有人能提供帮助,我真的很感激。谢谢。

解决方法

将 AllenNLP 应用于新数据集时,通常需要实现新的 DatasetReader。在这种情况下,您可以简单地使现有的 SnliReader 适应您的数据集的格式,或者调整您的数据集的格式以使用现有的 SnliReader。您可以看到 here 该阅读器只查找 3 个字段:“gold_labels”(“标签”)、“sentence1”(“前提”)和“sentence2”(“假设”)。>