有效地在python中读取文本文件

问题描述

使用python在大量文本文件中搜索出现的字符串的“最佳”方法是什么？

据我了解，我们可以使用以下内容：

for f in files:
    with open("file.txt") as f:
        for line in f:
            # do stuff

Python在后台将文件分块缓存，因此IO损失的严重程度不如乍看之下。如果我最多只能读取几个文件，这是我的首选。

但是对于文件列表（或os.walk），我也可以执行以下操作：

for f in files:
    with open("file.txt") as f:
        lines = list(f)
    for line in lines:
        #do stuff
    # Or a variation on this

如果我要读取数百个文件，则希望在扫描之前将它们全部加载到内存中。这里的逻辑是将文件访问时间保持在最低限度（并使OS发挥其文件系统的作用），并使逻辑保持最小，因为IO通常是瓶颈。显然，这将花费更多的内存，但这会提高性能吗？

我的假设在这里是否正确和/或有更好的方法来做到这一点？如果没有明确的答案，什么是用python衡量的最佳方法是什么？

解决方法

那是过早的优化吗？

您实际上已描述了整个过程，是否真的需要加快速度？参见：https://stackify.com/premature-optimization-evil/

如果确实需要加快速度，则应该考虑采用某种线程化方法，因为它受I / O约束。

一种简单的方法是使用ThreadPoolExecutor，请参见：https://docs.python.org/3/library/concurrent.futures.html#threadpoolexecutor

另一种方法（如果您在Linux上）只是执行一些shell命令，例如'find'，'grep'等。-这些小的C程序经过了高度优化，可以肯定是最快的解决方案。您可能会使用Python包装这些命令。

Regexp的速度不快，就像@Abdul Rahman Ali所说的不正确：

$ python -m timeit '"aaaa" in "bbbaaaaaabbb"'
10000000 loops,best of 3: 0.0767 usec per loop
$ python -m timeit -s 'import re; pattern = re.compile("aaaa")' 'pattern.search("bbbaaaaaabbb")'
1000000 loops,best of 3: 0.356 usec per loop

在文本中搜索模式的最佳方法是使用正则表达式：

import re
f = open('folder.txt')
list_of_wanted_word=list()
for line in f:
    wanted_word=re.findall('(^[a-z]+)',l)  #find a text in a line and extract it
        for k in wanted_word:#putting the word in a list
            list_of_wanted_word.append(k)
print(list_of_wanted_word)

performance performance performance performance-testing python