从 spaCy Match 中提取后如何引用文本?

问题描述

我使用 spaCy 匹配来提取关键字。

import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")

matcher = Matcher(nlp.vocab,validate=True)

patterns = [{"LOWER": "self"},{'IS_PUNCT': True,'OP':'*'},{"LOWER": "employed"}]
patterns1 = [{'LOWER': 'finance'}]
patterns2 = [{'LOWER': 'accounting'}]
    
matcher.add("Experience",None,patterns)
matcher.add("CFA",patterns1)
matcher.add("CPA",patterns2)
    
text = """ I am a self employed working in a remote factory. However,I study finance and accounting by myself in
my spare time."""

doc = nlp(text)
matches = matcher(doc)

稍后,我创建了一个包含所有关键字的数据框:

L=[]
M=[]
for match_id,start,end in matches:
        rule_id = nlp.vocab.strings[match_id]  # get the unicode ID,i.e. 'CategoryID'
        span = doc[start : end]  # get the matched slice of the doc
        L.append(rule_id)
        M.append(span.text)

import pandas as pd
df = pd.DataFrame(
    {'Keywords': L,'Profession': M,})
print(df)

#Output
     Keywords     Profession
0  Experience  self employed
1         CFA        finance
2         CPA     accounting

然后我想在职业自雇时建立一个子集数据框。

#Output
     Keywords     Profession
0  Experience  self employed

如果我用硬编码来做,我每次都必须根据提取的测试进行调整。例如,文本可以是自雇、自雇、自雇等。

我很欣赏任何想法。谢谢

解决方法

在您的情况下,将 IS_PUNCT 设为可选应该这样做:

patterns = [{"LOWER": "self"},{'IS_PUNCT': True,'OP':'?'},{"LOWER": "employed"}]

我仍然不确定我是否知道您想要实现的目标。当您的模式匹配时,您是否希望始终保存“自雇人士”?如果是这样,这里有一个可能的解决方案:

for match_id,start,end in matches:
        rule_id = nlp.vocab.strings[match_id]  # get the unicode ID,i.e. 'CategoryID'
        span = doc[start : end]  # get the matched slice of the doc
        exp_span = span.text
        if rule_id == "Experience":
            exp_span = "self employed"
        L.append(rule_id)
        M.append(exp_span)