问题描述
我最近开始学习python3,完全是为了提高工作效率。这可能是一个非常基本的问题。
我知道对于字符串,我们可以使用str.split
将字符串拆分为多个部分,
根据给定的字符。
但是我该怎么做。
对于文件bigfile.txt
,某些行说
some intro lines xxxxxx
sdafiefisfhsaifdijsdjsia
dsafdsifdsiod
\item 12478621376321748324
sdfasfsdfafda
\item 23847328412834723
uduhfavfduhfu
sduhfhaiuesfhseuif
lots and other lines
\item 328347848732
pewprpewposdp
everthing up to and inclued this line
and the blank line too
some end lines dsahudfuha
dsfdsfdsf
有趣的是从\item xxxxx
开始并在其后的\item xxxxx
之前的行
bigfile_part1.txt
其中包含
\item 12478621376321748324
sdfasfsdfafda
bigfile_part2.txt
其中包含
\item 23847328412834723
uduhfavfduhfu
sduhfhaiuesfhseuif
lots and other lines
bigfile_part3.txt
其中包含
\item 328347848732
pewprpewposdp
everthing up to and inclued this line
and the blank line too
bigfile2.txt
bigfile3.txt
bigfile4.txt
完全相同。
解决方法
您可以使用itertools.groupby
分割文件。只要条件发生变化,groupby
就会创建子迭代器。您的情况就是一行是否以“ \ item”开头。
import itertools
records = []
record = None
for key,subiter in itertools.groupby(open('thefile'),lambda line: line.startswith("\item ")):
if key:
# in a \item group,which has 1 line
item_id = next(subiter).split()[1]
record = {"item_id":item_id}
else:
# in the the value subgroup
if record:
record["values"] = [line.strip() for line in subiter]
records.append(record)
for record in records:
print(record)
对于处理多个文件,您可以将其放入一个函数中,以每个文件一次被调用。然后是获取文件列表的问题。也许glob.glob("some/path/big*.txt")
。
另一种基于split
的{{1}}方法,
newline characters
import re
text = """some intro lines xxxxxx
sdafiefisfhsaifdijsdjsia
dsafdsifdsiod
\item 12478621376321748324
sdfasfsdfafda
...
"""
# split by newline characters
for i,j in enumerate(re.split('\n{2,}',text)):
if j.startswith("\item"):
print(f"bigfile{i}.txt",j,sep="\n") # dump to file here
,
由于它是一个大文件,因此我们不尝试将整个文件读取为字符串,而是尝试逐行读取文件。
import sys
def parseFromFile(filepath):
parsedListFromFile = []
unended_item = False
with open(filepath) as fp:
line = fp.readline()
while line:
if line.find("\item")!=-1 or unended_item:
if line.find("\item") != -1: #says that there is \item present in line
parsedListFromFile.append("\item"+line.split("\item")[-1])
unended_item=True
else:
parsedListFromFile[-1]+=line.split("\item")[-1]
line = fp.readline()
#write each item of parseListFromFile to file
for index,item in enumerate(parsedListFromFile):
with open(filepath+str(index)+".txt",'w') as out:
out.write(item + '\n')
def main():
#assuming you run script like this: pythonsplit.py myfile1.txt myfile2.txt ...
paths = sys.argv[1:] #this gets all cli args after pythonsplit.py
for path in paths:
parseFromFile(path) #call function for each file
if __name__ == "__main__": main()
*假设一行仅包含一个\ item。 *这不会忽略终点线。您可以将if或只是手动将其从最后一个文件中删除。