有没有办法从 torchtext.TabularDataset 创建一个 torch.utils.data.DataLoader 对象？

问题描述

我是机器学习的新手，正在为数据预处理任务苦苦挣扎。

我正在使用 PyTorch，我想做的是使用简单的 RNN 模型对文本进行分类。我不知道此时模型的架构是否相关。

为了给你一些上下文，我有两个 json 文件，我从中获取数据：一个 train.json 文件和一个 test.json 文件。

这些文件的每一行都表示如下：

{"text" : ["word_1","word_2",...,"word_n"],"label" : "actual_label"},

其中文本键是单词列表，标签是当前文本的实际标签。

到目前为止，我有这样的事情：

from torchtext import data

TEXT = data.Field()
LABEL = data.LabelField()

fields = {'text': ('text',TEXT),'label': ('label',LABEL)}

train_data,test_data = data.TabularDataset.splits(
                            path = 'data_path_folder',train = 'train.json',test = 'test.json',format = 'json',fields = fields)

然后，我为我的数据创建一个词汇表：

TEXT.build_vocab(train_data,max_size = max_size)
LABEL.build_vocab(train_data)

和迭代器：

batch_size = 64
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

train_iter,test_iter = data.BucketIterator.splits(
    (train_data,test_data),sort_key = lambda x: len(x.text),batch_size = batch_size,device = device)

现在，在训练循环中，我可以像这样轻松地遍历数据：

for batch in train_iter:
    batch.text # this is a tensor with a shape of [max words number from a sentence,number of sencentes]
    batch.label # this is a tensor with a shape of [number of sentences]

我的第一个问题是我的数据在何处转换为张量。当我调用 data.BucketIterator.splits 函数时，数据是否会更改为张量？或者在创建词汇表时有什么事情要做？

我想要做的是使用 torch.utils.data.DataLoader 来处理数据。

目标是像这样遍历数据：

for batch_idx,(text,label) in enumerate(DataLoader):
    text # this should be a tensor with a shape of [max words number from a sentence,number of sencentes]
    label # this should be a tensor with a shape of [number of sentences]

我尝试保留 data.TabularDataset.splits 部分并从那里创建一个数据加载器，但没有成功。

谁能帮我解决这个问题？

提前致谢！

解决方法

暂无找到可以解决该程序问题的有效方法，小编努力寻找整理中！

如果你已经找到好的解决方法，欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@）