问题描述
我正在使用(?P<name>)
命名捕获组,其中包含与冠状病毒大流行有关的动词和词干列表。
import regex
import pandas as pd
data = {'id':[1,2,3,4,5],'text':['The pandemy is spreading','He is fighting Covid-19','The pandemic virus spreads','This sentence is about a different topic','How do we stop the virus ?']}
df = pd.DataFrame(data)
def covid_lang(text):
predicates = ['avoid','contain','track','spread','contact','stop','combat','fight']
subjects = ['Corona','corona','Covid-19','epidem','infect','virus','pandem','disease','outbreak']
p1 = fr'(?<=\b(?P<predicate>{"|".join(predicates)}))[^\.]*(?P<subject>{"|".join(subjects)}[a-z]*)'
result = []
for m in regex.finditer(p1,text,regex.S):
result.append([m.group('predicate'),m.group('subject')])
p2 = fr'\b(?P<subject>{"|".join(subjects)})[^\.]*(?<=\b(?P<predicate>{"|".join(predicates)}))'
for m in regex.finditer(p2,regex.S):
result.append([m.group('subject'),m.group('predicate')])
return result
df['result'] = df['text'].apply(covid_lang)
如果有匹配项,我想返回的主题不仅是词的词干,还包括整个词(即“大流行”和“大流行”,而不是“大流行”)。我尝试在单词列表之后添加[a-z]*
,以便捕获组在单词结束时停止,但不会改变任何内容。
此外,是否可以在单个查询中加入两个查询(谓词先于主语,主题先于谓语)?我尝试使用(p1)|(p2)
,但不适用于已命名的已捕获组。
最后,是否可以在单个单词中包含Corona
和corona
之类的大写和小写字母?
解决方法
这应该完成所有三个操作:
from xml.etree.ElementPath import prepare_descendant
import regex
import pandas as pd
data = {'id':[1,2,3,4,5],'text':['The pandemy is spreading','He is fighting Covid-19','The pandemic virus spreads','This sentence is about a different topic','How do we stop the virus ?']}
df = pd.DataFrame(data)
def expand_word(word):
return f'({word}[a-z]*)'
def construct_named_group_from_list_of_words(word_type,word_list):
expanded_word_regex_list = [expand_word(stem) for stem in word_list]
word_in_named_group = fr'(?P<{word_type}>{"|".join(expanded_word_regex_list)})'
return word_in_named_group
def covid_lang(text):
predicates = ['avoid','contain','track','spread','contact','stop','combat','fight']
subjects = ['corona','covid-19','epidem','infect','virus','pandem','disease','outbreak']
predicate_in_named_group = construct_named_group_from_list_of_words("predicate",predicates)
subject_in_named_group = construct_named_group_from_list_of_words("subject",subjects)
result = []
p1 = fr'(?<=\b{predicate_in_named_group})[^\.]*{subject_in_named_group}'
p2 = fr'\b{subject_in_named_group}[^\.]*(?<=\b{predicate_in_named_group})'
p = fr'({p1})|({p2})'
for m in regex.finditer(p,text,regex.S | regex.IGNORECASE):
result.append([m.group('predicate'),m.group('subject')])
return result
df['result'] = df['text'].apply(covid_lang)
print(df)
输出:
id text result
0 1 The pandemy is spreading [[spreading,pandemy]]
1 2 He is fighting Covid-19 [[fight,Covid-19]]
2 3 The pandemic virus spreads [[spreads,pandemic]]
3 4 This sentence is about a different topic []
4 5 How do we stop the virus ? [[stop,virus]]
但是我不确定您是否始终要先输出谓词?如果没有,应该这样做:
from xml.etree.ElementPath import prepare_descendant
import regex
import pandas as pd
data = {'id':[1,subjects)
result = []
p1 = fr'(?<=\b{predicate_in_named_group})[^\.]*{subject_in_named_group}'
p2 = fr'\b{subject_in_named_group}[^\.]*(?<=\b{predicate_in_named_group})'
for m in regex.finditer(p1,m.group('subject')])
for m in regex.finditer(p2,regex.S | regex.IGNORECASE):
result.append([m.group('subject'),m.group('predicate')])
return result
df['result'] = df['text'].apply(covid_lang)
print(df)
输出:
id text result
0 1 The pandemy is spreading [[pandemy,spreading]]
1 2 He is fighting Covid-19 [[fight,Covid-19]]
2 3 The pandemic virus spreads [[pandemic,spreads]]
3 4 This sentence is about a different topic []
4 5 How do we stop the virus ? [[stop,virus]]