Transformers v4.x：将慢速分词器转换为快速分词器

问题描述

我正在关注 Transformer 的预训练模型 xlm-roberta-large-xnli 示例

from transformers import pipeline
classifier = pipeline("zero-shot-classification",model="joeddav/xlm-roberta-large-xnli")

我收到以下错误

ValueError: Couldn't instantiate the backend tokenizer from one of: (1) a `tokenizers` library serialization file,(2) a slow tokenizer instance to convert or (3) an equivalent slow tokenizer class to instantiate and convert. You need to have sentencepiece installed to convert a slow tokenizer to a fast one.

我使用的是变形金刚版本 '4.1.1'

解决方法

根据 Transformers v4.0.0 release，sentencepiece 作为必需的依赖项被删除。这意味着

“依赖 SentencePiece 库的分词器将无法用于标准转换器安装”

包括 XLMRobertaTokenizer。但是，sentencepiece 可以作为额外的依赖项安装

pip install transformers[sentencepiece]

或

pip install sentencepiece

如果您已经安装了变压器。

如果您在 google collab 中：

恢复出厂设置运行时。
使用以下命令升级 pip (pip install --upgrade pip)
使用以下命令安装句子 (!pip install sentencepiece)

huggingface-tokenizers huggingface-transformers nlp python