问题描述
我有多个文本文件,并且想要在特定模式匹配时提取字符串,并将其附加到带有文件名和字符串的数据框中。在我的情况下,这些文本文件中存在多个相同的模式。
sample.txt:
"government high school
Govt high school physics department
Employee Designation School Assistant"
What I am getting:
file | Org | Org2
sample.txt government high school Govt high school physics department
sample.txt government high school Employee Designation School Assistant
What I am looking for:
file | Org | Org2
sample.txt government high school Govt high school physics department
这是我正在使用的代码:
prs_path = "C://Users//subhr//scope_txt//"
df3 = []
for file in os.listdir(prs_path):
Name = None
with open(prs_path + file) as fd:
for line in fd:
line = line.lower()
match = re.search('r(^.*government.*$)',line,re.I)
Org = ""
if match:
Org = match.group()
df3.append([file,Org])
Org2 = ""
Org3 = ""
Org = ""
if match is None:
match2 = re.search('r(^.*school.*$)|(^.*college.*$)',re.I)
if match2:
Org2 = match2.group()
df3.append([file,Org,Org2])
if match2 is None:
match3 = re.search('r(^.*power.*$)',re.I)
if match3:
Org3 = match3.group()
df3.append([file,Org2,Org3])
if match3 is None:
continue
我要去哪里错了?
解决方法
尝试使用这种情况r"^(.*?):$\n\"(.*?) (.*?)$\n(.*?) (.*? .*?) (.*?)$"
您的输入将分为6组,请检查一下以进行测试。