在spaCy的名词块/实体中删除领先的确定器

问题描述

我正在尝试使用新的实体类型来引导第一组训练数据,以便与spaCy的NER模型一起使用。我现有的大多数示例都由单个单词实体组成,但是我试图将它们合并以获得更具体的概念。

获取给定的可接受实体样本和测试字符串(请参阅底部的完整代码):

ent_list_sample = ['algorithm','data','engineering','software']
test_string = "We introduce a software-engineering inspired classification algorithm for dealing with bio@R_369_4045@ics data."

在这种特殊情况下,将ent_list_sample中的单词与spaCy的EntityRuler结合使用,然后与doc.noun_chunk跨度合并,可以使实体更容易接受。

print(doc.ents)
# (a software-engineering inspired classification algorithm,bio@R_369_4045@ics data)

问题:如何从第一个实体中删除确定者a,并将其设置为“软件工程启发分类算法”? spaCy如何处理名词块中的领先确定者?如果我现有的大多数实体都是单个单词,EntityRuler是否适合此引导任务?

MWE代码

import spacy
from spacy.pipeline import EntityRuler
from spacy.util import filter_spans

ent_list_sample = ['algorithm','software']
test_string = "We introduce a software-engineering inspired classification algorithm for dealing with bio@R_369_4045@ics data."


print("test_string:\n\t",test_string,"\n")

print("Default:\n-----------")
nlp = spacy.load("en")
doc = nlp(test_string)
print("Noun chunks:")
print(list(doc.noun_chunks),"\n")
print("Entities:")
print(doc.ents,"\n-------------------------------------------------------\n\n")


print("Adding patterns to EntityRuler:\n-----------")
patterns = []
for concept in ent_list_sample:
    doc = nlp.make_doc(concept)
    if len(doc) > 1:
        patterns.append({"label": "SCI","pattern":[{"LOWER":term.text.lower()} for term in doc]})
    else:
        patterns.append({"label": "SCI","pattern":doc.text.lower()})
ruler = EntityRuler(nlp)
ruler.add_patterns(patterns)
nlp.add_pipe(ruler)

doc = nlp(test_string)
print("Entities:")
print(doc.ents)
print(list(ent.label_ for ent in doc.ents),"\n-------------------------------------------------------\n\n")


print("Merge entities with retokenizer:\n-----------")
spans = list(doc.ents) + list(doc.noun_chunks)
spans = filter_spans(spans)
list(doc.noun_chunks)
with doc.retokenize() as retokenizer:
    for span in spans:
        retokenizer.merge(span)
print("Entities:")
print(doc.ents)
print(list(ent.label_ for ent in doc.ents))

MWE输出

test_string:
     We introduce a software-engineering inspired classification algorithm for dealing with bio@R_369_4045@ics data. 

Default:
-----------
Noun chunks:
[We,a software-engineering inspired classification algorithm,bio@R_369_4045@ics data] 

Entities:
() 
-------------------------------------------------------


Adding patterns to EntityRuler:
-----------
Entities:
(software,engineering,algorithm,data)
['SCI','SCI','SCI'] 
-------------------------------------------------------


Merge entities with retokenizer:
-----------
Entities:
(a software-engineering inspired classification algorithm,bio@R_369_4045@ics data)
['SCI','SCI'] 

解决方法

暂无找到可以解决该程序问题的有效方法,小编努力寻找整理中!

如果你已经找到好的解决方法,欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@)