NER spacy 自定义训练模型无法正确预测标签

问题描述

通过样本测试用例数据集使用文档 https://towardsdatascience.com/train-ner-with-custom-training-data-using-spacy-525ce748fab7 和 https://spacy.io/usage/processing-pipelines 训练的 NER spacy 自定义训练模型，以准确找到给定文本中的货币。

示例数据集：

TRAIN_DATA = [('This is AFN currency',{'entities': [(8,11,'CUR')]}),('I have EUR european currency',{'entities': [(7,10,('let as have ALL money',{'entities': [(12,15,('DZD is a dollar',{'entities': [(0,3,('money USD united states',{'entities': [(6,9,'CUR')]})
              ]

通过将模型命名为“货币”来成功训练模型。它对具有适当标签的训练数据集预测良好，但主要预测具有错误标签的未经训练的文本数据。

输入测试行：'我有大量的 AZWSQTS 印度 MZW 货币 USD INR'

输出：

AZWSQTS - CUR,LOT - CUR,MZW - CUR,USD - CUR,INR - CUR

这里，“AZWSQTS”和“LOT”不是货币，但它预测，这就是我遇到的问题。

完整代码：

from __future__ import unicode_literals,print_function
import random
from pathlib import Path
import spacy
from tqdm import tqdm
from spacy.training import Example

def spacy_train_model():
    ''' Sample traning dataset format'''
    '''list of currency'''
    currency_list = ['AFN','EUR','ALL','DZD','USD','AOA','XCD','ARS','AMD','AWG','SHP','AUD','AZN','','BSD','BHD','BDT','BBD','BYN','BZD','XOF','BMD','BTN','BOB','BAM','BWP','BRL','BND','BGN','BIF','CVE','KHR','XAF','CAD','KYD','NZD','CLP','CNY','cop','KMF','CDF','none','CRC','HRK','CUP','ANG','CZK','DKK','DJF','DOP','EGP','ERN','SZL','ETB','FKP','FJD','XPF','GMD','GEL','GHS','GIP','GTQ','GGP','GNF','GYD','HTG','HNL','HKD','HUF','ISK','INR','IDR','XDR','IRR','IQD','IMP','ILS','JMD','JPY','JEP','JOD','KZT','KES','KWD','KGS','LAK','LBP','LSL','LRD','LYD','CHF','MOP','MGA','MWK','MYR','MVR','MRU','MUR','MXN','MDL','MNT','MAD','MZN','MMK','NAD','NPR','NIO','NGN','KPW','MKD','NOK','omr','PKR','PGK','PYG','PEN','PHP','PLN','QAR','RON','RUB','RWF','WST','STN','SAR','RSD','SCR','sll','SGD','SBD','SOS','ZAR','GBP','KRW','ssp','LKR','SDG','SRD','SEK','SYP','TWD','TJS','TZS','THB','TOP','TTD','TND','TRY','TMT','UGX','UAH','AED','UYU','UZS','VUV','VES','VND','YER','ZMW','USD']


    TRAIN_DATA = [('This is AFN currency',('I have EUR europen currency','CUR')]})
              ]

    # model = "en_core_web_lg"
    model = None
    output_dir=Path(r"D:\currency") # Path to save training model - create new empty directory
    n_iter=100

    #load the model

    if model is not None:
        nlp = spacy.load(model)
        optimise = nlp.create_optimizer()
        print("Loaded model '%s'" % model)
    else:
        nlp = spacy.blank('en')
        optimise = nlp.begin_training()
        print("Created blank 'en' model")

    #set up the pipeline

    if 'ner' not in nlp.pipe_names:
        ner = nlp.create_pipe('ner')
        nlp.add_pipe('ner',last=True)
    else:
        ner = nlp.get_pipe('ner')


    for _,annotations in TRAIN_DATA:
        for ent in annotations.get('entities'):
            ner.add_label(ent[2])

    other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
    with nlp.disable_pipes(*other_pipes):  # only train NER
        optimizer = nlp.initialize()
        # optimizer = optimise
        for itn in range(n_iter):
            random.shuffle(TRAIN_DATA)
            losses = {}

            for text,annotations in tqdm(TRAIN_DATA):
                doc = nlp.make_doc(text)
                example = Example.from_dict(doc,annotations)
                nlp.update(
                    [example],drop=0.5,sgd=optimizer,losses=losses)
            print(losses)

    for text,_ in TRAIN_DATA:
        doc = nlp(text)
        print('Entities',[(ent.text,ent.label_) for ent in doc.ents])


    if output_dir is not None:
        output_dir = Path(output_dir)
        if not output_dir.exists():
            output_dir.mkdir()
        nlp.to_disk(output_dir)
        print("Saved model to",output_dir)
    
    

def test_model(text):
    nlp = spacy.load(r'D:\currency')
    for tex in text.split('\n'):
        doc = nlp(tex)
        for token in doc.ents:
            print(token.text,token.label_)
        
        
spacy_train_model()     #Training the model
test_model('text')      #Testing the model

解决方法

这里有几个想法...

您无法仅使用五个示例来训练模型。也许这只是示例代码，您还有更多示例代码，但您通常需要数百个示例。

如果您只需要识别美元或英镑等货币名称，请使用 spaCy 的 rule-based matchers。如果这些以某种方式不明确，您只需要一个 NER 模型。就像如果 ALL 是一种货币，但您不想在“我吃掉所有甜甜圈”中识别它，NER 模型可以提供帮助，但这是一个非常难以学习的区别，因此您需要数百个示例。

在您的示例问题中可能发生的情况是 NER 模型已经了解到任何全资本代币都是一种货币。如果你想用 NER 模型解决这个问题，你需要给出一个例子，说明全资本代币不是可以学习的货币。

custom-training named-entity-recognition spacy

NER spacy 自定义训练模型无法正确预测标签

问题描述

解决方法

相关问答