并行下载和提取，最大限度地提高性能？

问题描述

我想下载并提取 100 个 tar.gz 文件，每个文件大小为 1GB。目前，我已经通过多线程和通过内存字节流避免磁盘 IO 来加速它，但是谁能告诉我如何使它更快（只是为了好奇）？

from bs4 import BeautifulSoup
import requests
import tarfile

import multiprocessing as mp
from concurrent.futures import ThreadPoolExecutor

# speed up by only extracting what we need
def select(members):
    for file in members:  
        if any(ext in file.name for ext in [".tif",".img"]):
            yield file

# for each url download the tar.gz and extract the necessary files
def download_and_extract(x):
    # read and unzip as a byte stream
    r = requests.get(x,stream=True)
    tar = tarfile.open(fileobj=r.raw,mode='r|gz')
    tar.extractall(members=select(tar))
    tar.close()


# parallel download and extract the 96 1GB tar.gz files
links = get_asset_links()
# 3 * cpu count seemed to be fastest on a 4 core cpu
with ThreadPoolExecutor(3 * mp.cpu_count()) as executor:
    executor.map(download_and_extract,links)

我目前的方法需要 20 到 30 分钟。我不确定理论上可能的加速是多少，但如果有帮助，单个文件的下载速度是 20 MB/s。

如果有人能满足我的好奇心，那将不胜感激！我研究的一些东西是 asyncio、aiohttp 和 aiomultiprocess、io.BytesIO 等。但我无法让它们与 tarfile 库一起工作。

解决方法

您的计算很可能是IO 限制。压缩通常是一项缓慢的任务，尤其是 gzip 算法（新算法可以快得多）。从提供的信息来看，平均读取速度约为 70 Mo/s。这意味着存储吞吐量至少约为 140 Mo/s。它看起来完全正常和预期。如果您使用 HDD 或慢速 SSD，则尤其如此。

除此之外，由于选择了 members，您似乎遍历文件两次。请记住，tar gz 文件是打包在一起然后用 gzip 压缩的一大块文件。要遍历文件名，tar 文件需要已经部分解压缩。对于 tarfile（可能的缓存）的实现，这可能不是问题。如果所有丢弃文件的大小都很小，最好将整个存档解压缩为原始文件，然后删除要丢弃的文件。另外，如果你的内存很大，而且所有丢弃文件的大小都不小，可以先将文件解压到内存虚拟存储设备中，以便写入丢弃的文件。这可以在 Linux 系统上本地完成。

asynchronous download multithreading multithreading parallel-processing tar