问题描述
我试图将 dataset
转换为 .spacy
,方法是先将其转换为 doc
,然后再转换为 DocBin
。整个 dataset
文件可通过 GoogleDocs 访问。
我运行以下函数:
def converter(data,outputFile):
nlp = spacy.blank("en") # load a new spacy model
doc_bin = DocBin() # create a DocBin object
for text,annot in tqdm(data): # data in prevIoUs format
doc = nlp.make_doc(text) # create doc object from text
ents = []
for start,end,label in annot["entities"]: # add character indexes
# supported modes: strict,contract,expand
span = doc.char_span(start,label=label,alignment_mode="strict")
# to avoid having the traceback;
# TypeError: object of type 'nonetype' has no len()
if span is None:
pass
else:
ents.append(span)
doc.ents = ents # label the text with the ents
doc_bin.add(doc)
doc_bin.to_disk(f"./{outputFile}.spacy") # save the docbin object
return f"Processed {len(doc_bin)}"
在 dataset
上运行该函数后,我得到了回溯:
ValueError: [E1010] Unable to set entity information for token 27 which is included in more than one span in entities,blocked,missing or outside.
在仔细查看 dataset
文件以查找引发此回溯的 text
后,我发现了以下内容:
[('HereLongText..(abstract)',{'entities': [('0','27','Specificdisease'),('80','93',('260','278',('615','628',('673','691',('754','772','Specificdisease')]})]
我不知道如何解决这个问题。
解决方法
我认为这应该能让您清楚地了解您的问题。这是您的代码的略微修改版本,具有相同的错误。
import spacy
from spacy.tokens import DocBin
from tqdm import tqdm
def converter(data,outputFile):
nlp = spacy.blank("en") # load a new spacy model
doc_bin = DocBin() # create a DocBin object
for text,annot in tqdm(data): # data in previous format
doc = nlp.make_doc(text) # create doc object from text
ents = []
for start,end,label in annot["entities"]: # add character indexes
# supported modes: strict,contract,expand
span = doc.char_span(start,label=label,alignment_mode="strict")
# to avoid having the traceback;
# TypeError: object of type 'NoneType' has no len()
if span is None:
pass
else:
ents.append(span)
doc.ents = ents # label the text with the ents
doc_bin.add(doc)
doc_bin.to_disk(f"./{outputFile}.spacy") # save the docbin object
return f"Processed {len(doc_bin)}"
data = [("I like cheese",{"entities": [
(0,1,"Sample"),(0,# Same thing twice
]})]
converter(data,"out.txt")
请注意,在示例中,完全相同的跨度有两个注释。如果您删除其中一个注释,则不会出现错误。
您可能会收到错误消息,因为您的注释重叠且不可用。