ValueError:无法为包含在实体中的多个跨度中的令牌 27 设置实体

问题描述

我试图将 dataset 转换为 .spacy方法是先将其转换为 doc,然后再转换为 DocBin。整个 dataset 文件可通过 GoogleDocs 访问。

我运行以下函数

def converter(data,outputFile):
    nlp = spacy.blank("en") # load a new spacy model
    doc_bin = DocBin() # create a DocBin object

    for text,annot in tqdm(data): # data in prevIoUs format
        doc = nlp.make_doc(text) # create doc object from text    
        ents = []
        
        for start,end,label in annot["entities"]: # add character indexes
            # supported modes: strict,contract,expand
            span = doc.char_span(start,label=label,alignment_mode="strict")
            # to avoid having the traceback; 
            # TypeError: object of type 'nonetype' has no len()
            if span is None:
                pass
            else:
                ents.append(span)
        doc.ents = ents # label the text with the ents
        doc_bin.add(doc)
        
    doc_bin.to_disk(f"./{outputFile}.spacy") # save the docbin object
    return f"Processed {len(doc_bin)}"

dataset 上运行该函数后,我得到了回溯: ValueError: [E1010] Unable to set entity information for token 27 which is included in more than one span in entities,blocked,missing or outside.

在仔细查看 dataset 文件以查找引发此回溯的 text 后,我发现了以下内容

[('HereLongText..(abstract)',{'entities': [('0','27','Specificdisease'),('80','93',('260','278',('615','628',('673','691',('754','772','Specificdisease')]})]

我不知道如何解决这个问题。

解决方法

我认为这应该能让您清楚地了解您的问题。这是您的代码的略微修改版本,具有相同的错误。

import spacy
from spacy.tokens import DocBin
from tqdm import tqdm

def converter(data,outputFile):
    nlp = spacy.blank("en")  # load a new spacy model
    doc_bin = DocBin()  # create a DocBin object

    for text,annot in tqdm(data):  # data in previous format
        doc = nlp.make_doc(text)  # create doc object from text
        ents = []

        for start,end,label in annot["entities"]:  # add character indexes
            # supported modes: strict,contract,expand

            span = doc.char_span(start,label=label,alignment_mode="strict")
            # to avoid having the traceback;
            # TypeError: object of type 'NoneType' has no len()
            if span is None:
                pass
            else:
                ents.append(span)
        doc.ents = ents  # label the text with the ents
        doc_bin.add(doc)

    doc_bin.to_disk(f"./{outputFile}.spacy")  # save the docbin object
    return f"Processed {len(doc_bin)}"


data = [("I like cheese",{"entities": [
        (0,1,"Sample"),(0,# Same thing twice
        ]})]

converter(data,"out.txt")

请注意,在示例中,完全相同的跨度有两个注释。如果您删除其中一个注释,则不会出现错误。

您可能会收到错误消息,因为您的注释重叠且不可用。

相关问答

Selenium Web驱动程序和Java。元素在(x,y)点处不可单击。其...
Python-如何使用点“。” 访问字典成员?
Java 字符串是不可变的。到底是什么意思?
Java中的“ final”关键字如何工作?(我仍然可以修改对象。...
“loop:”在Java代码中。这是什么,为什么要编译?
java.lang.ClassNotFoundException:sun.jdbc.odbc.JdbcOdbc...