Google Cloud Storage gcsfs - 将 .tar 文件直接读入 python

问题描述

我在 GCS 中有一个 .tar 文件，我希望能够将文件直接读入 python，而无需先将文件下载到某处的中间步骤。

我在想这样的事情：

import gcsfs
fs = gcsfs.GCSFileSystem(project='my-google-project')

with fs.open('my_bucket/my_tar_file.tar','rb') as f:
    tarfile.open(f)

但是 f 是一个已经打开的文件连接，所以 .open 当然再次不起作用。这可能吗？

解决方法

我像@LaurentLAPORTE 那样使用 tarfile 库，但以不同的方式实现它。使用对象 fs 打开 tar 文件，然后使用 tarfile.open 的文件对象并循环遍历 tarfile 成员以获取文件的内容。

import tarfile
import gcsfs

fs = gcsfs.GCSFileSystem(project="your-project-here")

with fs.open('your-bucket/test.tar') as f:
    tr = tarfile.open(fileobj=f,mode='r:')

    for member in tr.getmembers():
        f=tr.extractfile(member)
        content=f.read()
        print(content.decode('utf-8')) // add decode since output in bytes and not in utf-8 format
    tr.close()

test.tar（也上传到我的存储桶）包含 sample_file.txt，其内容是：

试运行：

tarfile.open 函数接受一个 fileobj 参数：

如果指定了 fileobj，它被用作以二进制模式打开的文件对象的替代名称。它应该在位置 0。

所以，这个解决方案应该有效：

import contextlib
import tarfile

import gcsfs


fs = gcsfs.GCSFileSystem(project="my-google-project")

with contextlib.closing(tarfile.open(fileobj=fs,mode='r:')) as f:
    for entry in f:
        ...

不要忘记关闭您的 fs 文件。

google-cloud-platform google-cloud-storage python tar