问题描述
我有一个脚本,我在其中提取xlsx文件并将其重新格式化,并创建一个记录重新格式化的txt文档。该脚本可以很好地工作,并且可以执行我想要的操作。但是,由于没有充分利用多处理功能,所以速度不如我想的快。有时,每个“ files_xlsx”中可能只有少数几个文件被重新格式化。如果我删除processes.join(),它将最终崩溃。理想情况下,我希望它一次可以在多个“ files_xlsx” /目录等中的多个xlsx工作表上工作。但是我在编写代码方面并不走运。是否可以通过简单的方法来调整当前代码,以使其一次可以在更多xlsx上运行?
解决方法
要充分利用Python的multiprocessing
库,最直接的方法是使用Pool
。
请查看对代码的修改,如下所示。请注意,我没有以任何方式修改def rename_sheets
。
# From Python 3.4 onwards,you can use pathlib
from pathlib import Path
def convert_excel_txt(fil):
# directories is a globally defined variable. Not needed as an argument
# Variable name *file* is not a good idea.
# This method is to process one and only one file
# The multiprocessing is taken care of by Pool
open_xl = openpyxl.load_workbook(fil)
titles = xls.sheet_names()
# print(len(titles))
count = 1
for title in titles:
# print("{}.| {}".format(count,title))
sheet_title_value = rename_sheets(title,count,open_xl,fil)
# We'll navigate to the directory we're working on
directory = Path(fil).parent
with open(directory+"\\Reference_Sheets\\"+fil[:-5]+".txt",'a',encoding='utf-8') as outfile:
outfile.write('\n'+str(count)+". "+sheet_title_value)
count +=1
directories = open(r"C:\Python38\Projects\s_&p_500_links_test.txt","r")
files = []
for directory in directories:
directory = directory[:-1]
print(directory)
report_type = "Annual"
path = os.chdir(directory)
files = os.listdir(directory+"\\"+report_type)
print(files)
files_xlsx = [f for f in files if f[-4:] == 'xlsx']
pool = Pool(24)
pool.map(convert_excel_txt,files_xlsx )
要定时执行各种版本的代码,请按照下列步骤操作:
import time
import datetime
overall_start_time = time.time()
print('Started at ',time.strftime('%X %x %Z'))
# timed code goes here
print ("Time elapsed overall (hours:min:sec): %s" % str(datetime.timedelta(seconds=(time.time()- overall_start_time))))
Reference:https://docs.python.org/2/library/multiprocessing.html