问题描述
我正在使用一个内置的 Spacy 模型,它是 en_core_web_lg
并且想使用我的自定义实体对其进行训练。这样做时,我面临两个问题,
-
它用旧的数据覆盖新的训练数据,导致无法识别其他实体。例如, 训练前可以识别人与组织,但训练后无法识别人与组织。
-
在训练过程中,它给了我以下错误,
UserWarning: [W030] Some entities Could not be aligned in the text "('I work in Google.',)" with entities "[(9,15,'ORG')]". Use `spacy.training.offsets_to_biluo_tags(nlp.make_doc(text),entities)` to check the alignment. Misaligned entities ('-') will be ignored during training.
这是我的全部代码,
import spacy
import random
from spacy.util import minibatch,compounding
from pathlib import Path
from spacy.training.example import Example
sentence = ""
body1 = "James work in Facebook and love to have tuna fishes in the breafast."
nlp_lg = spacy.load("en_core_web_lg")
print(nlp_lg.pipe_names)
doc = nlp_lg(body1)
for ent in doc.ents:
print(ent.text,ent.start_char,ent.end_char,ent.label_)
train = [
('I had tuna fish in breakfast',{'entities': [(6,14,'FOOD')]}),('I love prawns the most',12,('fish is the rich source of protein',{'entities': [(0,4,('I work in Google.',{'entities': [(9,'ORG')]})
]
ner = nlp_lg.get_pipe("ner")
for _,annotations in train:
for ent in annotations.get("entities"):
ner.add_label(ent[2])
disable_pipes = [pipe for pipe in nlp_lg.pipe_names if pipe != 'ner']
with nlp_lg.disable_pipes(*disable_pipes):
optimizer = nlp_lg.resume_training()
for interation in range(30):
random.shuffle(train)
losses = {}
batches = minibatch(train,size=compounding(1.0,4.0,1.001))
for batch in batches:
text,annotation = zip(*batch)
doc1 = nlp_lg.make_doc(str(text))
example = Example.from_dict(doc1,annotations)
nlp_lg.update(
[example],drop = 0.5,losses = losses,sgd = optimizer
)
print("Losses",losses)
doc = nlp_lg(body1)
for ent in doc.ents:
print(ent.text,ent.label_)
预期输出:
James 0 5 PERSON
Facebook 14 22 ORG
tuna fishes 40 51 FOOD
目前无法识别任何实体..
请告诉我我哪里做错了。谢谢!
解决方法
您所描述的“覆盖”被称为“灾难性遗忘”,there's a post on the spaCy blog 对此进行了描述。没有完美的解决方法,但我们最近修复了 here。
关于你的对齐错误...
"('I work in Google.',)" 带有实体 "[(9,15,'ORG')]"
您的字符偏移已关闭。
"I work in Google."[9:15]
# => " Googl"
也许它们偏离了一个常数值,您可以通过向所有内容添加一个来解决此问题,但您需要查看数据才能弄清楚。