问题描述
通过样本测试用例数据集使用文档 https://towardsdatascience.com/train-ner-with-custom-training-data-using-spacy-525ce748fab7 和 https://spacy.io/usage/processing-pipelines 训练的 NER spacy 自定义训练模型,以准确找到给定文本中的货币。
示例数据集:
TRAIN_DATA = [('This is AFN currency',{'entities': [(8,11,'CUR')]}),('I have EUR european currency',{'entities': [(7,10,('let as have ALL money',{'entities': [(12,15,('DZD is a dollar',{'entities': [(0,3,('money USD united states',{'entities': [(6,9,'CUR')]})
]
通过将模型命名为“货币”来成功训练模型。它对具有适当标签的训练数据集预测良好,但主要预测具有错误标签的未经训练的文本数据。
输入测试行:'我有大量的 AZWSQTS 印度 MZW 货币 USD INR'
输出:
AZWSQTS - CUR,LOT - CUR,MZW - CUR,USD - CUR,INR - CUR
这里,“AZWSQTS”和“LOT”不是货币,但它预测,这就是我遇到的问题。
完整代码:
from __future__ import unicode_literals,print_function
import random
from pathlib import Path
import spacy
from tqdm import tqdm
from spacy.training import Example
def spacy_train_model():
''' Sample traning dataset format'''
'''list of currency'''
currency_list = ['AFN','EUR','ALL','DZD','USD','AOA','XCD','ARS','AMD','AWG','SHP','AUD','AZN','','BSD','BHD','BDT','BBD','BYN','BZD','XOF','BMD','BTN','BOB','BAM','BWP','BRL','BND','BGN','BIF','CVE','KHR','XAF','CAD','KYD','NZD','CLP','CNY','cop','KMF','CDF','none','CRC','HRK','CUP','ANG','CZK','DKK','DJF','DOP','EGP','ERN','SZL','ETB','FKP','FJD','XPF','GMD','GEL','GHS','GIP','GTQ','GGP','GNF','GYD','HTG','HNL','HKD','HUF','ISK','INR','IDR','XDR','IRR','IQD','IMP','ILS','JMD','JPY','JEP','JOD','KZT','KES','KWD','KGS','LAK','LBP','LSL','LRD','LYD','CHF','MOP','MGA','MWK','MYR','MVR','MRU','MUR','MXN','MDL','MNT','MAD','MZN','MMK','NAD','NPR','NIO','NGN','KPW','MKD','NOK','omr','PKR','PGK','PYG','PEN','PHP','PLN','QAR','RON','RUB','RWF','WST','STN','SAR','RSD','SCR','sll','SGD','SBD','SOS','ZAR','GBP','KRW','ssp','LKR','SDG','SRD','SEK','SYP','TWD','TJS','TZS','THB','TOP','TTD','TND','TRY','TMT','UGX','UAH','AED','UYU','UZS','VUV','VES','VND','YER','ZMW','USD']
TRAIN_DATA = [('This is AFN currency',('I have EUR europen currency','CUR')]})
]
# model = "en_core_web_lg"
model = None
output_dir=Path(r"D:\currency") # Path to save training model - create new empty directory
n_iter=100
#load the model
if model is not None:
nlp = spacy.load(model)
optimise = nlp.create_optimizer()
print("Loaded model '%s'" % model)
else:
nlp = spacy.blank('en')
optimise = nlp.begin_training()
print("Created blank 'en' model")
#set up the pipeline
if 'ner' not in nlp.pipe_names:
ner = nlp.create_pipe('ner')
nlp.add_pipe('ner',last=True)
else:
ner = nlp.get_pipe('ner')
for _,annotations in TRAIN_DATA:
for ent in annotations.get('entities'):
ner.add_label(ent[2])
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
with nlp.disable_pipes(*other_pipes): # only train NER
optimizer = nlp.initialize()
# optimizer = optimise
for itn in range(n_iter):
random.shuffle(TRAIN_DATA)
losses = {}
for text,annotations in tqdm(TRAIN_DATA):
doc = nlp.make_doc(text)
example = Example.from_dict(doc,annotations)
nlp.update(
[example],drop=0.5,sgd=optimizer,losses=losses)
print(losses)
for text,_ in TRAIN_DATA:
doc = nlp(text)
print('Entities',[(ent.text,ent.label_) for ent in doc.ents])
if output_dir is not None:
output_dir = Path(output_dir)
if not output_dir.exists():
output_dir.mkdir()
nlp.to_disk(output_dir)
print("Saved model to",output_dir)
def test_model(text):
nlp = spacy.load(r'D:\currency')
for tex in text.split('\n'):
doc = nlp(tex)
for token in doc.ents:
print(token.text,token.label_)
spacy_train_model() #Training the model
test_model('text') #Testing the model
解决方法
这里有几个想法...
您无法仅使用五个示例来训练模型。也许这只是示例代码,您还有更多示例代码,但您通常需要数百个示例。
如果您只需要识别美元或英镑等货币名称,请使用 spaCy 的 rule-based matchers。如果这些以某种方式不明确,您只需要一个 NER 模型。就像如果 ALL 是一种货币,但您不想在“我吃掉所有甜甜圈”中识别它,NER 模型可以提供帮助,但这是一个非常难以学习的区别,因此您需要数百个示例。
在您的示例问题中可能发生的情况是 NER 模型已经了解到任何全资本代币都是一种货币。如果你想用 NER 模型解决这个问题,你需要给出一个例子,说明全资本代币不是可以学习的货币。