如何在Python的文本文件中提取部分图案化的行的一部分?

问题描述

我有一个文本文件,其内容如下:

0:00 txt txt e-mail1_to_extract txt_to_extract1 txt txt /data
0:00 txt txt e-mail2_to_extract txt_to_extract2 txt txt /data
0:00 txt txt txt e-mail3_to_extract txt_to_extract3 txt txt /var
0:00 txt txt txt txt e-mail4_to_extract txt_to_extract4 txt txt /var
0:00 txt txt e-mail5_to_extract txt_to_extract5 txt txt /data

首先,我想提取“ 0:00”和“ / data”或“ / var”之间的所有这些行。其次,我想处理这些数据,以便仅提取其中的两个部分。此已提取范围中包含的文本尚未标准化,因此我不能使用“ startwith” /“ endwith”之类的内容,但是,整个文本(如整个单词)会被合并,并且其位置始终在发送电子邮件后重复部分。有没有办法专门映射该部分并提取电子邮件和下一个字符串?

Txt =我不想提取的多余文本。

我已经尝试从下面的代码开始,但是没有得到任何结果:

with open('content.txt') as infile,open('extraction.txt','w') as outfile:
copy = False
for line in infile:
    if line.strip() == "0:00":
        copy = True
        continue
    elif line.strip() == "/":
        copy = False
        continue
    elif copy:
        outfile.write(line)

所需的输出

e-mail1_to_extract txt_to_extract1
e-mail2_to_extract txt_to_extract2
e-mail3_to_extract txt_to_extract3
e-mail4_to_extract txt_to_extract4
e-mail5_to_extract txt_to_extract5

谢谢!

解决方法

我使用了您提供的格式的示例文件-

0:00 txt txt123 [email protected] txt_to_extract1 txt6456 txtssss /data
0:00 txt11 txt111 [email protected] txt_to_extract2 txtssss txtffff /data
0:00 txt111 txt123 txt [email protected] txt_to_extract3 txtosvbsvs txtkkkk /var
0:00 txt456 txt3663 [email protected] txt e-mail4_to_extract txt_to_extract4 txabjahsjat txtasba /var
0:00 txtGJK txtfggg [email protected] txt_to_extract5 txtbxajla txtzbaza /data

我使用了以下代码(用于确定电子邮件的功能,请相应地更改正则表达式)-

import re 
  
regex = '^[a-z0-9]+[\._]?[a-z0-9]+[@]\w+[.]\w{2,3}$'
def check(email):    
    if(re.search(regex,email)):  
        return True
    else:  
        return False
        
def getcols(row):
    for i in row.keys():
        if check(row[i]):
            return str(row[i]) + " " + str(row[i+1])
        else:
            return ""


ls = []
with open('TestData.txt') as infile,open('extraction.txt','w') as outfile:
    for line in infile:
        ls = line.split()
        for i in range(len(ls)):
            if check(ls[i]):
                try:
                    outfile.write(ls[i] + " " + ls[i+1]+"\n")
                except:
                    pass
                
            

我得到以下输出-

[email protected] txt_to_extract1
[email protected] txt_to_extract2
[email protected] txt_to_extract3
[email protected] txt
[email protected] txt_to_extract5