写入文件的最佳方式

问题描述

哪种写入文件的方式更好?

# 1 way
whole_data = ""
for file_name in list_of_files:
    r_file = open(file_name,'r')
    whole_data += r_file.read()
    r_file.close()
with open("destination_file.txt",'w') as w_file:
    w_file.write(whole_data)


# 2 way
for file_name in list_of_files:
    r_file = open(file_name,'r')
    with open("destination_file.txt",'a') as w_file:
        w_file.write(r_file.read())
    r_file.close()

# separate open/colse for write
w_file = open("destination_file.txt",'w')
for file_name in list_of_files:
    with open(file_name,'r') as r_file:
        w_file.write(r_file.read())
w_file.close()

1 方式先将整个数据保存到超级字符串中,然后将其写入目标文件。 2 路从文件读取并立即将数据附加到目标文件。 我曾经在代码中使用两种方式,但我不确定哪种更好。你知道这两种方式的优缺点吗?如果您知道更好的做法,请分享。 // 编辑: 添加第三种方式

解决方法

"一旦离开 with 块,with 语句会自动关闭文件,即使出现错误也是如此。我强烈建议您尽可能多地使用 with 语句,因为它允许更清晰的代码并进行处理任何意外错误对您来说都更容易。”

check this out

,

直观地说,第二种方式“感觉”更快,但您可以随时尝试并为它们计时。

,

timeit 模块接受两个字符串,一个语句 (stmt) 和一个设置。然后它运行设置代码并运行 stmt 代码 n 次并报告平均花费的时间长度。

def func_one(n):
    setup = '''
    whole_data = ""
    for file_name in list_of_files:
        r_file = open(file_name,'r')
        whole_data += r_file.read()
        r_file.close()
    with open("destination_file.txt",'w') as w_file:
        w_file.write(whole_data) '''

stmt = 'func_one(10)'

timeit.timeit(10) # Shows the time taken to do this func 10 times

我向文件写入 10 次的原因,以便 timeit 可以找到确切的值而不是四舍五入的值

同样的,你可以用第二种方法 -

setup = '''
def func_two(n):
    for file_name in list_of_files:
    r_file = open(file_name,'r')
    with open("destination_file.txt",'a') as w_file:
        w_file.write(r_file.read())
    r_file.close()'''

stmt = 'func_one(10)'
    
timeit.timeit(10) # Shows the time taken to do this func 10 times

然后可以比较打印出来的时间。

我知道这太过分了。但有时,看代码无法分辨哪个需要更快

,

如果您的小文件数量有限,我想您不会注意到任何区别,但是如果您使用第一种方法处理许多非常大的文件,您将消耗大量内存,基本上没有任何原因,所以第二种方法肯定更具可扩展性。

也就是说,您可能不需要在每次迭代时重新打开(并隐式关闭)输出文件,这可能会减慢速度,具体取决于操作系统、磁盘/网络性能等。您可以像这样重构代码

with open("destination_file.txt",'a') as w_file:
    for file_name in list_of_files:
       with open(file_name,'r') as r_file
          w_file.write(r_file.read())
,

这是一个尽力而为的测试工具,但我怎么强调它在现实中证明的很少。在 3-4 次运行中(每次 10K 次试验),每一次都至少领先一次,并且仅领先 0.1s - .2s(超过 10K 次试验!)。也就是说,我正在我的工作站上运行一些 IO 密集型 ML 模型,所以其他人可能会产生更可靠的数字。无论如何,我会说这是一种语法选择,性能不是主要问题。

我进行了一些语法更改(在适当的地方嵌套),并在设置一些文件后将每个方法移动到一个函数中。如果您像@gimix 所说的那样更改每个文件中的行数,您也可能会发现不同的数字。根据他的回答,全数据方法也会不必要地使用大量内存,因此这可能是编写干净、高性能且面向未来的代码的决定性因素。

import timeit

test_files = []

for n in range(100):
    file_name = f'test_file_{n}.txt'
    with open(file_name,'w') as f:
        for i in range(10):
            f.write(f'{i}\n')
        test_files.append(file_name)


def whole_data():
    data = ""
    for file in test_files:
        with open(file,'r') as fr:
            data += fr.read()
        
    with open('whole_data_file.txt','w') as fw:
        fw.write(data)


def file_by_file():
    with open('line_by_line_file.txt','w') as fw:
        for file in test_files:
            with open(file,'r') as fr:
                fw.write(fr.read())


print('Whole data method:',timeit.timeit("whole_data()",globals=globals(),number=10_000))
# Whole data method: 10.38545351603534
# Whole data method: 10.356000136991497

print('File by file method:',timeit.timeit("file_by_file()",number=10_000))
# File by file method: 10.356590001960285
# File by file method: 10.507033439003862

请注意,如果不在 SSD 上运行,上述所有操作可能需要一分钟以上的时间(我使用的是高性能 NVME SSD)

,

好的,我做了一些测试。结果并不像我预期的那样。 首先,我认为第一次测试中的运算符 += 会首先引发内存问题。内存问题仅发生在保存文件的第二种“方式”中。我想让操作更复杂,所以我在输入文件中添加了替换字符。

唯一的预期结果是 1way 和 3way 之间的时间差(分开)。 仅适用于 30 个文件 1路:0.26499秒 2路:0.648秒 三路:0.242秒

超过 30 个文件在 2way 中有“MemoryError”,所以我从测试中排除了这种方式。

对于所有(222 个大文件)结果是非常可预测的: 1路:39秒 三路:1.577秒

代码:

from os import listdir
from os.path import isfile,join
import time

my_path = r"path_to_files"
list_of_files = [f for f in listdir(my_path) if isfile(join(my_path,f))]
print(len(list_of_files))

repeat = 20


if True:  # 39.381000042 sec
    # flash destination_file.txt
    with open("destination_file.txt",'w') as w_file:
        w_file.write("")

    now = time.time()  # start counting
    for i in range(repeat):
        # 1 way
        whole_data = ""
        for file_name in list_of_files:
            with open(file_name,'r') as r_file:
                tmp = r_file.read().replace('d','A')
                whole_data += tmp
        with open("destination_file.txt",'w') as w_file:
            w_file.write(whole_data)

    print(time.time() - now)  # print time elapsed
    # --------------- 1 way ---------------


if True:  # MemoryError
    # flash destination_file.txt
    with open("destination_file.txt",'w') as w_file:
        w_file.write("")

    now = time.time()  # start counting
    for i in range(repeat):
        # 2 way
        for file_name in list_of_files:
            with open(file_name,'r') as r_file:
                with open("destination_file.txt",'a') as w_file:
                    tmp = r_file.read().replace('d','A')  # MemoryError
                    w_file.write(tmp)

    print(time.time() - now)  # print time elapsed
    # --------------- 3 way ---------------


if True:  # 1.53500008583 sec
    # flash destination_file.txt
    with open("destination_file.txt",'w') as w_file:
        w_file.write("")

    now = time.time()  # start counting
    for i in range(repeat):
        # separate open/close for write
        w_file = open("destination_file.txt",'w')
        for file_name in list_of_files:
            with open(file_name,'A')
                w_file.write(tmp)
        w_file.close()

    print(time.time() - now)  # print time elapsed
    # --------------- separate ---------------
,

对于许多文件来说,第二种方式看起来更安全、更快捷。

from os import listdir
from os.path import isfile,join
import timeit

my_path = r"./"
list_of_files = [f for f in listdir(my_path) if isfile(join(my_path,f))]
test_files = list_of_files
print(len(test_files),"files ~6.5kB per file")


def whole_data(amount):
    data = ""
    for file in test_files[:amount]:
        with open(file,'rb') as fr:
            data += str(fr.read())
        
    with open('whole_data_file.txt','w') as fw:
        fw.write(data)


def file_by_file(amount):
    with open('line_by_line_file.txt','w') as fw:
        for file in test_files[:amount]:
            with open(file,'rb') as fr:
                fw.write(str(fr.read()))


# all files are taken from game wither 3
# 100 files ~6.5kB per file
print('Whole data method 20/number=100:',timeit.timeit("whole_data(20)",number=100))
print('Whole data method 100/number=20:',timeit.timeit("whole_data(100)",number=20 ))
# Whole data method 20/number=100: 285.6315555
# Whole data method 100/number=20: 495.45210849999995

print('File by file method 20/number=100:',timeit.timeit("file_by_file(20)",number=100))
print('File by file method 100/number=20:',timeit.timeit("file_by_file(100)",number=20 ))
# File by file method 20/number=100: 212.43927700000006
# File by file method 100/number=20: 205.07520319999992