[tesseract-ocr][原创]tesseract训练lstm模型报错：LSTM: Training - Error msg

报错原因：

请参阅TrainingTesseract 4.00 · tesseract-ocr/tesseract Wiki · GitHub

Encoding of string failed! results when the text string for a training image
cannot be encoded using the given unicharset.

Possible causes are:

- There is an un-represented character in the text, say a British Pound sign that is not in your unicharset.

- A stray unprintable character (like tab or a control character) in the text.

- There is an un-represented Indic grapheme/aksara in the text.

In any case it will result in that training image being ignored by the trainer.

If the error is infrequent, it is harmless, but it may indicate that your unicharset is inadequate for representing the language that you are training.

其实上面意思归根到底是你训练的数据集里面不在字符集里面，由于是finetune模型一般是不需要自己做字符集，这就导致使用字符集刚好不包含你自定义的数据集中的字符，一般会忽略这种字符，不会使得训练受到影响，但是会导致你无法识别出来，因此我们可以在训练时候指定字符集

mkdir -p ~/tesstutorial/tellayer_from_tel

combine_tessdata -e ../tessdata/tel.traineddata \
~/tesstutorial/tellayer_from_tel/tel.lstm

lstmtraining -U ~/tesstutorial/tel/tel.unicharset \
--script_dir ../langdata --debug_interval 0 \
--continue_from ~/tesstutorial/tellayer_from_tel/tel.lstm \
--append_index 5 --net_spec '[Lfx256 O1c105]' \
--model_output ~/tesstutorial/tellayer_from_tel/tellayer \
--train_listfile ~/tesstutorial/tel/tel.training_files.txt \
--target_error_rate 0.01

字符集怎么生成呢：

采用下面命令：

unicharset_extractor --output_unicharset chi_sim.unicharset --norm_mode 1 FIRC.box

set_unicharset_properties -U chi_sim.unicharset -O chi_sim.unicharset --script_dir ./

参考文献：

怎样使用已有的工具训练Tesseract 3.03–3.05来识别新的语言_Wordsky的博客-CSDN博客

https://github.com/tesseract-ocr/tesseract/issues/549

lstm 机器学习深度学习

[tesseract-ocr][原创]tesseract训练lstm模型报错：LSTM: Training - Error msg - Encoding of string failed!

相关文章