问题描述
我正在为研究工作立即使用Kaldi和srilm工具,但是在使用ngram-merge合并ngram-count生成的3-gram.count文件时遇到一个奇怪的问题。 (ngram-count和ngram-merge是srilm中的两个模块)
我在shell脚本中使用的代码如下所示:
ngram-merge \
-write $dir_ngram/corpus_${ng}-gram.count \
$dir_ngram/glsp_poj_tlu.txt_${ng}-gram.count /
$dir_ngram/icorpus_tlu.txt_${ng}-gram.count /
$dir_ngram/khkp_tlu.txt_${ng}-gram.count /
$dir_ngram/nmtl_tlu.txt_${ng}-gram.count /
$dir_ngram/total_tlu.txt_${ng}-gram.count /
$dir_ngram/twbb_tlu.txt_${ng}-gram.count
$ dir_ngram 仅代表.count文件的目录,而 $ {ng} 在此处为3,因为我在语言模型中使用了trigram。 / p>
/kaldi/egs/simple_20190520/source/ngram/icorpus_tlu.txt_3-gram.count: line 1: unk: No such file or directory
/kaldi/egs/simple_20190520/source/ngram/icorpus_tlu.txt_3-gram.count: line 2: Syntax error near unexpected token `<'
/kaldi/egs/simple_20190520/source/ngram/icorpus_tlu.txt_3-gram.count: line 2: `<unk> <unk> 11844000'
/kaldi/egs/simple_20190520/source/ngram/khkp_tlu.txt_3-gram.count: line 1: unk: No such file or directory
/kaldi/egs/simple_20190520/source/ngram/khkp_tlu.txt_3-gram.count: line 2: Syntax error near unexpected token `<'
/kaldi/egs/simple_20190520/source/ngram/khkp_tlu.txt_3-gram.count: line 2: `<unk> <unk> 449400'
/kaldi/egs/simple_20190520/source/ngram/nmtl_tlu.txt_3-gram.count: line 1: unk: No such file or directory
/kaldi/egs/simple_20190520/source/ngram/nmtl_tlu.txt_3-gram.count: line 2: Syntax error near unexpected token `<'
/kaldi/egs/simple_20190520/source/ngram/nmtl_tlu.txt_3-gram.count: line 2: `<unk> <unk> 13706200'
/kaldi/egs/simple_20190520/source/ngram/total_tlu.txt_3-gram.count: line 1: unk: No such file or directory
/kaldi/egs/simple_20190520/source/ngram/total_tlu.txt_3-gram.count: line 2: Syntax error near unexpected token `<'
/kaldi/egs/simple_20190520/source/ngram/total_tlu.txt_3-gram.count: line 2: `<unk> <unk> 11155390'
/kaldi/egs/simple_20190520/source/ngram/twbb_tlu.txt_3-gram.count: line 1: unk: No such file or directory
/kaldi/egs/simple_20190520/source/ngram/twbb_tlu.txt_3-gram.count: line 2: Syntax error near unexpected token `<'
/kaldi/egs/simple_20190520/source/ngram/twbb_tlu.txt_3-gram.count: line 2: `<unk> <unk> 7575840'
似乎 ngram-merge将文件的第一行作为文件名或目录,因为unk符号是每个.count文件的第一行(请使用 icorpus_tlu.txt_3 -gram.count ):
<unk> 21952800
<unk> <unk> 11844000
<unk> <unk> <unk> 6161460
<unk> <unk> pó-tshî 660
<unk> <unk> pe̍h-liáu-kang 60
<unk> <unk> m̄-sī 3840
<unk> <unk> lîu-hîng 540
<unk> <unk> ē-sái 12900
<unk> <unk> uî-huat 1740
<unk> <unk> kín-tiunn 780
<unk> <unk> tâi-tiong-tshī 840
<unk> <unk> kuī 120
<unk> <unk> tsú-lâng 660
<unk> <unk> tsi̍t 38520
.
.
.
.count文件的unk符号和第二行出现在错误消息的第一行和第三行中。我不知道为什么会这样,因为我认为ngram-merge应该只打开文件并开始读取ngram,而不将内容视为要打开的目录。另一个奇怪的是,“将内容作为目录”问题仅出现在最后五个文件上。第一个文件似乎根本没有读取或目录问题。
我知道我可以将语料库合并在一起,因为所有的语料库都不太大,但是我对此问题有点好奇。有人知道如何解决这个问题吗?
解决方法
暂无找到可以解决该程序问题的有效方法,小编努力寻找整理中!
如果你已经找到好的解决方法,欢迎将解决方案带上本链接一起发送给小编。
小编邮箱:dio#foxmail.com (将#修改为@)