在 pandas.DataFrame.to_csv 命令中使用 fsspec

问题描述

我想从通过 smtp-ssh 连接的远程机器上的 Pandas 数据帧写入 csv 文件。有人知道如何正确添加“storage_options”参数吗？

Pandas 文档说我必须使用一些 dict 作为参数的值。但我不明白到底是哪个。

✖ Downloading template
error Error: Command Failed: yarn init -y
Usage: yarn [options]

yarn: error: no such option: -y

每次我得到 hits_df.to_csv('hits20.tsv',compression='gzip',index='False',chunksize=1000000,storage_options={???})

我做错了什么？

解决方法

如果您没有云存储访问权限，您可以通过指定这样的匿名连接来访问公共数据

pd.read_csv('name',<other fields>,storage_options={"anon": True})

否则应该以dict格式传递storage_options，您的云VM主机（包括Amazon S3、Google Cloud、Azure等）将获得name和key>

pd.read_csv('name',\
           storage_options={'account_name': ACCOUNT_NAME,'account_key': ACCOUNT_KEY})

您将通过直接试验实现后端 SFTPFileSystem 来找到要使用的值集。无论您使用什么 kwarg，这些都与 stoage_options 中的相同。小故事：paramiko 与命令行 SSH 不同，因此需要进行一些试验。

如果你有通过文件系统类工作的东西，你可以使用替代路线

fs = fsspec.implementations.sftp.SFTPFileSystem(...)
# same as fs = fsspec.filesystem("ssh",...)
with fs.open("my/file/path","rb") as f:
    pd.read_csv(f,other_kwargs)

fsspec pandas python ssh ssh