Google Colab 中的 SentencePiece

问题描述

我想在 Google Colab 项目中使用 https://github.com/google/sentencepiece 中的句子，我正在训练 opennmt 模型。我对如何在 Google Colab 中设置句子二进制文件有点困惑。我需要用 cmake 构建吗？

当我尝试使用 pip install sentencepiece 进行安装并尝试在脚本的“转换”中包含句子时，出现以下错误

运行此脚本后（匹配自 opennmt 翻译教程） !onmt_build_vocab -config en-sp.yaml -n_sample -1

我明白了：

Traceback (most recent call last):
  File "/usr/local/bin/onmt_build_vocab",line 8,in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.7/dist-packages/onmt/bin/build_vocab.py",line 63,in main
    build_vocab_main(opts)
  File "/usr/local/lib/python3.7/dist-packages/onmt/bin/build_vocab.py",line 32,in build_vocab_main
    transforms = make_transforms(opts,transforms_cls,fields)
  File "/usr/local/lib/python3.7/dist-packages/onmt/transforms/transform.py",line 176,in make_transforms
    transform_obj.warm_up(vocabs)
  File "/usr/local/lib/python3.7/dist-packages/onmt/transforms/tokenize.py",line 110,in warm_up
    load_src_model.Load(self.src_subword_model)
  File "/usr/local/lib/python3.7/dist-packages/sentencepiece/__init__.py",line 367,in Load
    return self.LoadFromFile(model_file)
  File "/usr/local/lib/python3.7/dist-packages/sentencepiece/__init__.py",line 171,in LoadFromFile
    return _sentencepiece.SentencePieceProcessor_LoadFromFile(self,arg)
TypeError: not a string

下面是我的脚本是如何编写的。我不确定非字符串是从哪里来的。

## Where the samples will be written
save_data: en-sp/run/example

## Where the vocab(s) will be written
src_vocab: en-sp/run/example.vocab.src
tgt_vocab: en-sp/run/example.vocab.tgt

## Where the model will be saved
save_model: drive/MyDrive/Europarl/model/model

# Prevent overwriting existing files in the folder
overwrite: False

# Corpus opts:
data:
    europarl:
        path_src: train_europarl-v7.es-en.es
        path_tgt: train_europarl-v7.es-en.en
        transforms: [sentencepiece,filtertoolong]
        weight: 1

    valid:
        path_src: dev_europarl-v7.es-en.es
        path_tgt: dev_europarl-v7.es-en.en
        transforms: [sentencepiece]

skip_empty_level: silent

world_size: 1
gpu_ranks: [0]
...

编辑：所以我继续在谷歌上搜索更多问题，并找到了一个使用 cmake 构建句子的谷歌 colab 项目 https://colab.research.google.com/github/mymusise/gpt2-quickly/blob/main/examples/gpt2_quickly.ipynb#scrollTo=dDAup5dxDXZW。但是，即使在使用 cmake 构建之后，我仍然遇到这个问题。

解决方法

为了解决这个问题，我必须过滤和标记我的数据集，然后用句子进行训练。我使用了以下有用来源中的脚本：https://github.com/ymoslem/MT-Preparation 完成所有工作，现在我的模型正在训练中！

google-colaboratory machine-translation tokenize