如何在 Google Colab 上以流模式加载数据集？

问题描述

我正在尝试节省一些磁盘空间以在 Google Colab 上使用 CommonVoice 法语数据集 (19G)，因为我的笔记本总是因磁盘空间不足而崩溃。我从 HuggingFace 文档中看到，我们可以以流模式加载数据集，因此我们可以class PreferencesView(generics.RetrieveUpdateAPIView): def get_serializer_class(self): if self.request.method == "PATCH": return PatchPreferencesSerializer return PreferencesSerializer # ...。我尝试在 Google Colab 中使用该模式，但无法使其正常工作 - 我还没有在 SO 上找到有关此问题的任何信息。

iterate over it directly without having to download the entire dataset.

然后，我收到以下错误：

!pip install datasets
!pip install 'datasets[streaming]'
!pip install aiohttp

common_voice_train = load_dataset("common_voice","fr",split="train",streaming=True)

Google Colab 不允许流式加载数据集是否有原因？

否则，我错过了什么？

解决方法

写一个答案以方便日后参考。根据@kkgarg 的评论，流媒体功能似乎尚未实现。

!pip install aiohttp
!pip install datasets
from datasets import load_dataset,load_metric

common_voice_train = load_dataset("common_voice","fr",split="train",streaming=True)

触发以下错误：

/usr/local/lib/python3.7/dist-packages/datasets/utils/streaming_download_manager.py in _get_extraction_protocol(self,urlpath)
    137         elif path.endswith(".zip"):
    138             return "zip"
--> 139         raise NotImplementedError(f"Extraction protocol for file at {urlpath} is not implemented yet")
    140 
    141     def download_and_extract(self,url_or_urls):

NotImplementedError: Extraction protocol for file at https://voice-prod-bundler-ee1969a6ce8178826482b88e843c335139bd3fb4.s3.amazonaws.com/cv-corpus-6.1-2020-12-11/tr.tar.gz is not implemented yet

表示尚未实现或支持流式传输功能。也许是因为使用 common_voice 意味着文件需要解压缩而流媒体不支持 (?)。因为功能肯定是实现的，因为它在文档中......

google-colaboratory huggingface-datasets huggingface-transformers python

如何在 Google Colab 上以流模式加载数据集？

问题描述

解决方法

相关问答