问题描述
我想使用基于规则的匹配 我有一个像 POS 的每个单词一样的文本:
text1= "it_PRON is_AUX a_DET beautiful_ADJ apple_NOUN"
text2= "it_PRON is_AUX a_DET beautiful_ADJ and_CCONJ big_ADJ apple_NOUN"
所以我想创建一个基于规则的匹配,如果我们有一个 ADJ 后跟名词(NOUN)或一个 ADJ 后跟(PUNCT 或 CCONJ)后跟一个 ADJ 后跟一个名词(NOUN)
所以,我想输出:
text1 = [beautiful_ADJ apple_NOUN]
text2= [beautiful_ADJ and_CCONJ big_ADJ apple_NOUN]
我试图这样做,但我没有找到允许这样做的正确模式:
from spacy.matcher import Matcher,PhraseMatcher
import spacy
import spacy
from spacy.matcher import Matcher
matchers = {"first_processing": Matcher(nlp.vocab,validate=True)}
nlp = spacy.load("en_core_web_sm")
pattern = [{},{},{}] #################################### we must find the right pattern
matchers["first_processing"].add("process_1",None,pattern)
nlp = spacy.load("en_core_web_sm")
doc = nlp("it_PRON is_AUX a_DET beautiful_ADJ and_CCONJ big_ADJ apple_NOUN")
a=matcher(doc)
for match_id,start,end in a:
text = doc[start:end].text
print(text)
解决方法
我不知道 spacy
但这里有一个 re
(标准库模块)解决方案:
import re
REGEX = re.compile(r"\w+_ADJ +(?:\w+(?:_CCONJ|_PUNCT) +\w+_ADJ +)*\w+_NOUN")
def extract(s):
try:
[extracted] = re.findall(REGEX,s)
except ValueError:
return []
else:
return extracted.split()
>>> extract("it_PRON is_AUX a_DET beautiful_ADJ and_CCONJ big_ADJ apple_NOUN")
['beautiful_ADJ','and_CCONJ','big_ADJ','apple_NOUN']
>>> extract("it_PRON is_AUX a_DET beautiful_ADJ apple_NOUN")
['beautiful_ADJ','apple_NOUN']
,
我知道您有 texts = ["it is a beautiful apple","it is a beautiful and big apple"]
,并计划定义几个 Matcher
模式来提取您拥有的文本中的某些 POS 模式。
您可以定义具有所需模式的列表列表,并将其作为第三个+参数传递给matcher.add
:
from spacy.matcher import Matcher,PhraseMatcher
import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab,validate=True)
patterns = [
[{'POS': 'ADJ'},{'POS': 'NOUN'}],[{'POS': 'ADJ'},{'POS': 'CCONJ'},{'POS': 'ADJ'},{'POS': 'PUNCT'},{'POS': 'NOUN'}]
]
matcher.add("process_1",None,*patterns)
texts= ["it is a beautiful apple","it is a beautiful and big apple"]
for text in texts:
doc = nlp(text)
matches = matcher(doc)
for _,start,end in matches:
print(doc[start:end].text)
# => beautiful apple
# beautiful and big apple
# big apple