如何根据某些字符有效地分割文本文件?

问题描述

我最近开始学习python3,完全是为了提高工作效率。这可能是一个非常基本的问题。

我知道对于字符串,我们可以使用str.split将字符串拆分为多个部分, 根据给定的字符。

但是我该怎么做。

对于文件bigfile.txt,某些行说

some intro lines xxxxxx
sdafiefisfhsaifdijsdjsia
dsafdsifdsiod

\item 12478621376321748324
sdfasfsdfafda

\item 23847328412834723
uduhfavfduhfu
sduhfhaiuesfhseuif
lots and other lines


\item 328347848732
pewprpewposdp
everthing up to and inclued this line
and the blank line too

some end lines dsahudfuha
dsfdsfdsf

有趣的是从\item xxxxx开始并在其后的\item xxxxx之前的行

如何有效地拆分bigfile.txt,所以我有以下内容

bigfile_part1.txt其中包含

\item 12478621376321748324
sdfasfsdfafda

bigfile_part2.txt其中包含

\item 23847328412834723
uduhfavfduhfu
sduhfhaiuesfhseuif
lots and other lines

bigfile_part3.txt其中包含

\item 328347848732
pewprpewposdp
everthing up to and inclued this line
and the blank line too

忽略intro linesend lines

此外,我该如何应用此功能拆分批处理文件,例如

bigfile2.txt
bigfile3.txt
bigfile4.txt

完全相同。

解决方法

您可以使用itertools.groupby分割文件。只要条件发生变化,groupby就会创建子迭代器。您的情况就是一行是否以“ \ item”开头。

import itertools

records = []
record = None

for key,subiter in itertools.groupby(open('thefile'),lambda line: line.startswith("\item ")):
    if key:
        # in a \item group,which has 1 line
        item_id = next(subiter).split()[1]
        record = {"item_id":item_id}
    else:
        # in the the value subgroup
        if record:
            record["values"] = [line.strip() for line in subiter]
            records.append(record)

for record in records:
    print(record)

对于处理多个文件,您可以将其放入一个函数中,以每个文件一次被调用。然后是获取文件列表的问题。也许glob.glob("some/path/big*.txt")

,

另一种基于split的{​​{1}}方法,

newline characters

import re

text = """some intro lines xxxxxx
sdafiefisfhsaifdijsdjsia
dsafdsifdsiod

\item 12478621376321748324
sdfasfsdfafda
...
"""

# split by newline characters
for i,j in enumerate(re.split('\n{2,}',text)):
   if j.startswith("\item"):
       print(f"bigfile{i}.txt",j,sep="\n") # dump to file here
,

由于它是一个大文件,因此我们不尝试将整个文件读取为字符串,而是尝试逐行读取文件。

import sys
def parseFromFile(filepath):
    parsedListFromFile = []
    unended_item = False
    with open(filepath) as fp:
        line = fp.readline()
        while line:
            if line.find("\item")!=-1 or unended_item: 
                if line.find("\item") != -1: #says that there is \item present in line
                    parsedListFromFile.append("\item"+line.split("\item")[-1])
                    unended_item=True  
                else:
                    parsedListFromFile[-1]+=line.split("\item")[-1]
            line = fp.readline()               
    #write each item of parseListFromFile to file
    for index,item in enumerate(parsedListFromFile):
        with open(filepath+str(index)+".txt",'w') as out:
            out.write(item + '\n')

def main():
    #assuming you run script like this: pythonsplit.py myfile1.txt myfile2.txt ...
    paths = sys.argv[1:] #this gets all cli args after pythonsplit.py
    for path in paths:
        parseFromFile(path) #call function for each file

if __name__ == "__main__": main()

 

*假设一行仅包含一个\ item。 *这不会忽略终点线。您可以将if或只是手动将其从最后一个文件中删除。