在拥抱中使用字母组合标记器时如何启用采样？

问题描述

在huggingface transformer documentation上提到过，当使用unigram标记器时，“您可以根据其概率对其中一种标记进行采样”。但是，该文档没有详细说明如何这样做。

AlbertTokenizer和T5Tokenizer（XLNet）的源代码中，_tokenize()方法中都有一个参数“ sample”，该参数依次从句子中调用编码方法：启用了采样。

    def _tokenize(self,text,sample=False):
        """ Tokenize a string. """
        text = self.preprocess_text(text)

        if not sample:
            pieces = self.sp_model.EncodeAsPieces(text)
        else:
            pieces = self.sp_model.SampleEncodeAsPieces(text,64,0.1)
        ...

但是，当我检查_tokenize的用法时，似乎仅在未传递“样本”参数的情况下调用了它。

        def split_on_tokens(tok_list,text):
            if not text.strip():
                return []
            if not tok_list:
                return self._tokenize(text)

我尝试使用encode()调用tokenizer类的sample=True方法，但是该关键字未被识别。

albert_tokenizer = AlbertTokenizer('m.model')
albert_tokenizer.encode(msg,sample=True)

>> Keyword arguments {'sample': True} not recognized.

在拥抱中使用字母组合标记器时应如何启用采样？

我的github issue。

解决方法

暂无找到可以解决该程序问题的有效方法，小编努力寻找整理中！

如果你已经找到好的解决方法，欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@）

huggingface-tokenizers huggingface-transformers python