更新 Spacy 的内置 NER 模型而不是覆盖

问题描述

我正在使用一个内置的 Spacy 模型,它是 en_core_web_lg 并且想使用我的自定义实体对其进行训练。这样做时,我面临两个问题,

  1. 它用旧的数据覆盖新的训练数据,导致无法识别其他实体。例如, 训练前可以识别人与组织,但训练后无法识别人与组织。

  2. 在训练过程中,它给了我以下错误

UserWarning: [W030] Some entities Could not be aligned in the text "('I work in Google.',)" with entities "[(9,15,'ORG')]". Use `spacy.training.offsets_to_biluo_tags(nlp.make_doc(text),entities)` to check the alignment. Misaligned entities ('-') will be ignored during training.

这是我的全部代码

import spacy
import random
from spacy.util import minibatch,compounding
from pathlib import Path
from spacy.training.example import Example
sentence = ""
body1 = "James work in Facebook and love to have tuna fishes in the breafast."
nlp_lg = spacy.load("en_core_web_lg")
print(nlp_lg.pipe_names)
doc = nlp_lg(body1)
for ent in doc.ents:
    print(ent.text,ent.start_char,ent.end_char,ent.label_)


train = [
    ('I had tuna fish in breakfast',{'entities': [(6,14,'FOOD')]}),('I love prawns the most',12,('fish is the rich source of protein',{'entities': [(0,4,('I work in Google.',{'entities': [(9,'ORG')]})
    ]


ner = nlp_lg.get_pipe("ner")

for _,annotations in train:
    for ent in annotations.get("entities"):
        ner.add_label(ent[2])

disable_pipes = [pipe for pipe in nlp_lg.pipe_names if pipe != 'ner']

with nlp_lg.disable_pipes(*disable_pipes):
    optimizer = nlp_lg.resume_training()
    for interation in range(30):
        random.shuffle(train)
        losses = {}

        batches = minibatch(train,size=compounding(1.0,4.0,1.001))
        for batch in batches:
            text,annotation = zip(*batch)
            doc1 = nlp_lg.make_doc(str(text))
            example = Example.from_dict(doc1,annotations)
            nlp_lg.update(
                [example],drop = 0.5,losses = losses,sgd = optimizer
                )
            print("Losses",losses)

doc = nlp_lg(body1)
for ent in doc.ents:
    print(ent.text,ent.label_)

预期输出

James 0 5 PERSON
Facebook 14 22 ORG
tuna fishes 40 51 FOOD

目前无法识别任何实体..

请告诉我我哪里做错了。谢谢!

解决方法

您所描述的“覆盖”被称为“灾难性遗忘”,there's a post on the spaCy blog 对此进行了描述。没有完美的解决方法,但我们最近修复了 here

关于你的对齐错误...

"('I work in Google.',)" 带有实体 "[(9,15,'ORG')]"

您的字符偏移已关闭。

"I work in Google."[9:15]
# => " Googl"

也许它们偏离了一个常数值,您可以通过向所有内容添加一个来解决此问题,但您需要查看数据才能弄清楚。

相关问答

Selenium Web驱动程序和Java。元素在(x,y)点处不可单击。其...
Python-如何使用点“。” 访问字典成员?
Java 字符串是不可变的。到底是什么意思?
Java中的“ final”关键字如何工作?(我仍然可以修改对象。...
“loop:”在Java代码中。这是什么,为什么要编译?
java.lang.ClassNotFoundException:sun.jdbc.odbc.JdbcOdbc...