Torchtext 0.7显示不推荐使用Field有什么选择?

问题描述

与以前的声明字段,示例和使用BucketIterator的范例类似,已过时,它将在0.8中变为旧版。但是,我似乎无法找到不使用Field的自定义数据集的新范例的示例(例如,不是torch.datasets中包含的范例)。谁能指出我的最新例子?

弃用参考:

https://github.com/pytorch/text/releases

解决方法

浏览torchtextGitHub repo 我偶然发现了README in the legacy directoryofficial docs 中没有记录。自述文件链接了一个 GitHub issue,解释了更改背后的基本原理以及一个 migration guide

如果您只想保持现有代码与 torchtext 0.9.0 一起运行,其中已弃用的类已移至 legacy 模块,您必须调整导入:

# from torchtext.data import Field,TabularDataset
from torchtext.legacy.data import Field,TabularDataset

或者,您可以按照自述文件的建议将整个 torchtext.legacy 模块导入为 torchtext

import torchtext.legacy as torchtext
,

有一篇关于这个的帖子。它没有使用已弃用的 FieldBucketIterator 类,而是使用 TextClassificationDataset 以及整理器和其他预处理。它读取一个txt文件并构建一个数据集,然后是一个模型。在帖子内,有一个指向完整工作笔记本的链接。该帖子位于:https://mmg10.github.io/pytorch/2021/02/16/text_torch.html。但是您需要 PyTorch 的“开发”(或每晚构建)才能使其工作。

来自上面的链接:

分词和构建词汇后,您可以构建数据集如下

def data_to_dataset(data,tokenizer,vocab):
    
    data = [(text,label) for (text,label) in data]
    
    text_transform = sequential_transforms(tokenizer.tokenize,vocab_func(vocab),totensor(dtype=torch.long)
                                          )
    label_transform = sequential_transforms(lambda x: 1 if x =='1' else (0 if x =='0' else x),totensor(dtype=torch.long)
                                          )
    
    
    transforms = (text_transform,label_transform)
    
    dataset = TextClassificationDataset(data,vocab,transforms)
    
    return dataset

整理者如下:

    def __init__(self,pad_idx):
        
        self.pad_idx = pad_idx
        
    def collate(self,batch):
        text,labels = zip(*batch)
        labels = torch.LongTensor(labels)
        text = nn.utils.rnn.pad_sequence(text,padding_value=self.pad_idx,batch_first=True)
        return text,labels

然后,您可以使用 torch.utils.data.DataLoader 参数使用典型的 collate_fn 构建数据加载器。

,

看来管道可能是这样的:

import torchtext as TT
import torch
from collections import Counter
from torchtext.vocab import Vocab

# read the data

with open('text_data.txt','r') as f:
    data = f.readlines()
with open('labels.txt','r') as f:
    labels = f.readlines()


tokenizer = TT.data.utils.get_tokenizer('spacy','en') # can remove 'spacy' and use a simple built-in tokenizer
train_iter = zip(labels,data)
counter = Counter()

for (label,line) in train_iter:
    counter.update(tokenizer(line))
    
vocab = TT.vocab.Vocab(counter,min_freq=1)

text_pipeline = lambda x: [vocab[token] for token in tokenizer(x)]
# this is data-specific - adapt for your data
label_pipeline = lambda x: 1 if x == 'positive\n' else 0

class TextData(torch.utils.data.Dataset):
    '''
    very basic dataset for processing text data
    '''
    def __init__(self,labels,text):
        super(TextData,self).__init__()
        self.labels = labels
        self.text = text
        
    def __getitem__(self,index):
        return self.labels[index],self.text[index]
    
    def __len__(self):
        return len(self.labels)


def tokenize_batch(batch,max_len=200):
    '''
    tokenizer to use in DataLoader
    takes a text batch of text dataset and produces a tensor batch,converting text and labels though tokenizer,labeler
    tokenizer is a global function text_pipeline
    labeler is a global function label_pipeline
    max_len is a fixed len size,if text is less than max_len it is padded with ones (pad number)
    if text is larger that max_len it is truncated but from the end of the string
    '''
    labels_list,text_list = [],[]
    for _label,_text in batch:
        labels_list.append(label_pipeline(_label))
        text_holder = torch.ones(max_len,dtype=torch.int32) # fixed size tensor of max_len
        processed_text = torch.tensor(text_pipeline(_text),dtype=torch.int32)
        pos = min(200,len(processed_text))
        text_holder[-pos:] = processed_text[-pos:]
        text_list.append(text_holder.unsqueeze(dim=0))
    return torch.FloatTensor(labels_list),torch.cat(text_list,dim=0)

train_dataset = TextData(labels,data)

train_loader = DataLoader(train_dataset,batch_size=2,shuffle=False,collate_fn=tokenize_batch)

lbl,txt = iter(train_loader).next()
,

我花了一些时间自己找到解决方案。对于预建的数据集,新的范式是这样的:

from torchtext.experimental.datasets import AG_NEWS
train,test = AG_NEWS(ngrams=3)

或对于自定义构建的数据集则如此:

from torch.utils.data import DataLoader
def collate_fn(batch):
    texts,labels = [],[]
    for label,txt in batch:
        texts.append(txt)
        labels.append(label)
    return texts,labels
dataloader = DataLoader(train,batch_size=8,collate_fn=collate_fn)
for idx,(texts,labels) in enumerate(dataloader):
    print(idx,texts,labels)

我已从Source

复制了示例