问题描述
我使用 spaCy 匹配来提取关键字。
import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab,validate=True)
patterns = [{"LOWER": "self"},{'IS_PUNCT': True,'OP':'*'},{"LOWER": "employed"}]
patterns1 = [{'LOWER': 'finance'}]
patterns2 = [{'LOWER': 'accounting'}]
matcher.add("Experience",None,patterns)
matcher.add("CFA",patterns1)
matcher.add("CPA",patterns2)
text = """ I am a self employed working in a remote factory. However,I study finance and accounting by myself in
my spare time."""
doc = nlp(text)
matches = matcher(doc)
稍后,我创建了一个包含所有关键字的数据框:
L=[]
M=[]
for match_id,start,end in matches:
rule_id = nlp.vocab.strings[match_id] # get the unicode ID,i.e. 'CategoryID'
span = doc[start : end] # get the matched slice of the doc
L.append(rule_id)
M.append(span.text)
import pandas as pd
df = pd.DataFrame(
{'Keywords': L,'Profession': M,})
print(df)
#Output
Keywords Profession
0 Experience self employed
1 CFA finance
2 CPA accounting
然后我想在职业自雇时建立一个子集数据框。
#Output
Keywords Profession
0 Experience self employed
如果我用硬编码来做,我每次都必须根据提取的测试进行调整。例如,文本可以是自雇、自雇、自雇等。
我很欣赏任何想法。谢谢
解决方法
在您的情况下,将 IS_PUNCT
设为可选应该这样做:
patterns = [{"LOWER": "self"},{'IS_PUNCT': True,'OP':'?'},{"LOWER": "employed"}]
我仍然不确定我是否知道您想要实现的目标。当您的模式匹配时,您是否希望始终保存“自雇人士”?如果是这样,这里有一个可能的解决方案:
for match_id,start,end in matches:
rule_id = nlp.vocab.strings[match_id] # get the unicode ID,i.e. 'CategoryID'
span = doc[start : end] # get the matched slice of the doc
exp_span = span.text
if rule_id == "Experience":
exp_span = "self employed"
L.append(rule_id)
M.append(exp_span)