使用pandas从谷歌云存储读取hdf文件

问题描述

问候编码人员和 Google 云开发人员和专业人士。我正在尝试使用 Pandas 提供的内置方法“pd.read_hdf()”从谷歌云存储中读取 hdf 文件列表，其中文件名是这样的（“client1.h”）。我的问题是我总是收到这个错误：

NotImplementedError: Support for generic buffers has not been implemented.

在不同的论坛和网站上深入搜索后，我发现很多人都遇到了同样的问题，但没有提供解决方案。

我使用的代码如下：

from google.cloud.storage import blob,bucket
import pandas as pd
from google.cloud import storage

storage_client = storage.Client.from_service_account_json('file___.json') 

bucket = storage_client.get_bucket('my_bucket_name')

blob = bucket.blob("data1.h")

df = pd.read_hdf(blob,mode='r+')

print(df)

我也尝试了下面的代码，但出现了同样的错误：

 blob = bucket.blob("data1.h")
 data = download_as_string() #as_bytes as_text
 df = pd.read_hdf(io.BytesIO(data),mode='r+')

当我将文件下载到我的本地环境并使用它的路径读取它时，它运行良好并且没有问题但不幸的是在云存储中我有大量文件所以我无法将它们全部下载到一起工作。

！！！请！！任何人有解决方案或建议，我请他分享。

解决方法

该功能似乎尚未实现。

正如您所提到的，首先将文件下载到本地文件系统可以让您使用 read_hdf()。这是一种可行的解决方法。

要使read_hdf() 工作，需要传递一个字符串，os.path.exists(path_or_buf) 将导致 True。您可能希望帮助 Pandas 开发人员实现该功能。如果是这样，see the current implementation here。

您遇到的问题已经在 pandas GitHub 存储库的问题部分打开，但是用户只提到问题发生在 S3 (see here) 中的数据。您可能想在该问题中分享您的问题或也打开一个新问题。要打开新问题，请go here。

google-cloud-storage hdf hdf5 pandas pandas python