问题描述
我需要获得 NER 'de_core_news_lg' 模型预测的标签的置信度分数。在 Spacy 2 中有一个众所周知的解决方案:
nlp = spacy.load('de_core_news_lg')
doc = nlp('ich möchte mit frau Mustermann in der Musterbank sprechen')
text = content
doc = nlp.make_doc(text)
beams = nlp.entity.beam_parse([doc],beam_width=16,beam_density=0.0001)
for score,ents in nlp.entity.moves.get_beam_parses(beams[0]):
print (score,ents)
entity_scores = defaultdict(float)
for start,end,label in ents:
# print ("here")
entity_scores[(start,label)] += score
print ('entity_scores',entity_scores)
但是,在 Spacy 3 中我收到以下错误:
AttributeError: 'German' object has no attribute 'entity'
显然 language
对象不再具有 entity
属性。
有谁知道如何在 Spacy 3 中获得置信度分数?
解决方法
答案的核心是'使用管道组件“beam_ner”,并查看EntityRecognizer.pyx代码。然后是单元测试 test_ner.py test_beam_ner_scores() 它几乎展示了如何做到这一点。 如果您想了解如何修改您的 config,cfg,请保存模型(如下面的 make_nlp() 中所做的那样)并查看保存的模型 config.cfg。
问题在于它仅适用于单元测试生成的“模型”。对于我的真实模型(每个 5000 个文档 ~4k 文本,训练 NER f-scores 大约 75%),它失败了。 “悲惨”是指“贪婪”搜索会找到我的实体,但“光束搜索”会报告数百个标记(甚至标点符号)的“分数”,例如 0.013。并且(基于偏移量)那些通常来自文档的一小部分。
这令人沮丧,因为我相信 spacy 训练(对于“beam_ner”)使用相同的代码来“验证”训练迭代,并且训练报告的分数几乎不错(嗯,比 Spacy 2 低 10%,但这种情况会发生用于训练 'ner' 和 'beam_ner' 的机器人)。
所以我发布这个是希望有人能有更好的运气或者可以指出我做错了什么。
到目前为止,Spacy3 对我来说是一场重大灾难:无法获得信心,我不能再使用 GPU(我只有 6GB),基于 Ray 的并行化不起作用(在 Windows 上=实验性)并且通过使用“变压器” ' 基于模型我的训练 NER 分数比在 Spacy 2 中差 10%。
代码
import spacy
from spacy.lang.en import English
from spacy.language import Language
from spacy.tokens import Doc
from spacy.training import Example
# Based upon test_ner.py test_beam_ner_scores()
TRAIN_DATA = [
("Who is Shaka Khan?",{"entities": [(7,17,"PERSON")]}),("I like London and Berlin.",13,"LOC"),(18,24,"LOC")]}),("You like Paris and Prague.",{"entities": [(9,14,(19,25,]
def make_nlp(model_dir):
# ORIGINALLY: Test that we can get confidence values out of the beam_ner pipe
nlp = English()
config = { "beam_width": 32,"beam_density": 0.001 }
ner = nlp.add_pipe("beam_ner",config=config)
train_examples = []
for text,annotations in TRAIN_DATA:
train_examples.append(Example.from_dict(nlp.make_doc(text),annotations))
for ent in annotations.get("entities"):
ner.add_label(ent[2])
optimizer = nlp.initialize()
# update once
losses = {}
nlp.update(train_examples,sgd=optimizer,losses=losses)
# save
#if not model_dir.exists():
#model_dir.mkdir()
nlp.to_disk(model_dir)
print("Saved model to",model_dir)
return nlp
def test_greedy(nlp,text):
# Report predicted entities using the default 'greedy' search (no confidences)
doc = nlp(text)
print("GREEDY search");
for ent in doc.ents:
print("Greedy offset=",ent.start_char,"-",ent.end_char,ent.label_,"text=",ent.text)
def test_beam(nlp,text):
# Report predicted entities using the beam search (beam_width 16 or higher)
ner = nlp.get_pipe("beam_ner")
# Get the prediction scores from the beam search
doc = nlp.make_doc(text)
docs = [doc]
# beams = StateClass returned from ner.predict(docs)
beams = ner.predict(docs)
print("BEAM search,labels",ner.labels);
# Show individual entities and their scores as reported
scores = ner.scored_ents(beams)[0]
for ent,sco in scores.items():
tok = doc[ent[0]]
lbl = ent[2]
spn = doc[ent[0]: ent[1]]
print('Beam-search',ent[0],ent[1],'offset=',tok.idx,lbl,'score=',sco,'text=',spn.text.replace('\n',' '))
MODEL_DIR = "./test_model"
TEST_TEXT = "I like London and Paris."
if __name__ == "__main__":
# You may have to repeat make_nlp() several times to produce a semi-decent 'model'
# nlp = make_nlp(MODEL_DIR)
nlp = spacy.load(MODEL_DIR)
test_greedy(nlp,TEST_TEXT)
test_beam (nlp,TEST_TEXT)
结果应该看起来像(在重复 make_nlp 以生成可用的“模型”之后):
GREEDY search
Greedy offset= 7 - 13 LOC text= London
Greedy offset= 18 - 23 LOC text= Paris
BEAM search,labels ('LOC','PERSON')
Beam-search 2 3 offset= 7 LOC score= 0.5315668466265199 text= London
Beam-search 4 5 offset= 18 LOC score= 0.7206478212662492 text= Paris
Beam-search 0 1 offset= 0 LOC score= 0.4679245513356703 text= I
Beam-search 3 4 offset= 14 LOC score= 0.4670399792743775 text= and
Beam-search 5 6 offset= 23 LOC score= 0.2799470367073933 text= .
Beam-search 1 2 offset= 2 LOC score= 0.21658368070744227 text= like
,
目前在 spaCy v3 中没有一个很好的方法来获得 NER 分数的置信度。但是,有一个正在开发中的 SpanCategorizer 组件可以使这变得容易。不确定,但我们希望在下一个次要版本中发布它。您可以在 the PR for the feature 中关注开发或阅读更多相关信息 here。