无法使用equ.traineddata进行tesseract喉咙错误,但是hin,ben,eng效果很好

问题描述

我在tesseract上安装了/usr/share/tesseract-ocr/,并且在tessdata上的/usr/share/tesseract-ocr/4.0/tessdata目录下工作正常。由于equ.traineddata并未提供原始数据,因此我从官方文档中将其删除,设法将其粘贴到/usr/share/tesseract-ocr/4.0/tessdata/equ.traineddata处。除此之外,我还粘贴了hin,ben和更多文件。当我使用--l eng+hin+ben时,它可以正常工作,但与equ一起时,会引发错误。我也使用PyTesseract进行一些配置,例如:

# making a copy of tessdata dir in the home
cli_config = '--oem 1 --psm 12 --tessdata-dir ~/tessdata/ -l eng+equ+ben+hin'
ocr.image_to_string(image=img_path,config=cli_config)

还有

cli_config = '--oem 1 --psm 12` # tessdata is at default location too
ocr.image_to_string(image=img_path,config=cli_config,lang='eng+equ+hin+ben`)

但它总是抛出错误仅用于 equ,例如:

TesseractError                            Traceback (most recent call last)
<ipython-input-30-8529ae8e51e8> in <module>
----> 1 ocr.image_to_string(image=img_path,lang='equ')

~/anaconda3/envs/py36/lib/python3.6/site-packages/PyTesseract/PyTesseract.py in image_to_string(image,lang,config,nice,output_type,timeout)
    356         Output.DICT: lambda: {'text': run_and_get_output(*args)},357         Output.STRING: lambda: run_and_get_output(*args),--> 358     }[output_type]()
    359 
    360 

~/anaconda3/envs/py36/lib/python3.6/site-packages/PyTesseract/PyTesseract.py in <lambda>()
    355         Output.BYTES: lambda: run_and_get_output(*(args + [True])),356         Output.DICT: lambda: {'text': run_and_get_output(*args)},--> 357         Output.STRING: lambda: run_and_get_output(*args),358     }[output_type]()
    359 

~/anaconda3/envs/py36/lib/python3.6/site-packages/PyTesseract/PyTesseract.py in run_and_get_output(image,extension,timeout,return_bytes)
    264         }
    265 
--> 266         run_tesseract(**kwargs)
    267         filename = kwargs['output_filename_base'] + extsep + extension
    268         with open(filename,'rb') as output_file:

~/anaconda3/envs/py36/lib/python3.6/site-packages/PyTesseract/PyTesseract.py in run_tesseract(input_filename,output_filename_base,timeout)
    240     with timeout_manager(proc,timeout) as error_string:
    241         if proc.returncode:
--> 242             raise TesseractError(proc.returncode,get_errors(error_string))
    243 
    244 

TesseractError: (1,'Error opening data file /home/deshwal/anaconda3/envs/py36/share/tessdata/equ.traineddata Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory. Failed loading language \'equ\' Tesseract Couldn\'t load any languages! Could not initialize tesseract.')

这可能是什么原因?如何使用equ.traineddata

解决方法

equ是传统语言数据。因此,您需要使用适当的oem值。尝试使用tesseract --help-extra命令来显示用法。