Tesseract无法准确地从扫描的pdf文档中提取文本数据

问题描述

我一直在开发一个应用程序，用于从扫描的PDF文档中提取文本，数字和其他数据，并将其转换为可编辑的.docx文件。我可以使用OpenCV成功检测或裁剪文本数据（将PDF转换为图像），但是问题是Tesseract的输出，例如如果图像中带有数字的文本表示“ OFF REC 0617第0677页”和“ Inst N0：980118843”，则Tesseract会将其分别检测为“ Rec 17 Space Q 677”和“ Isst N：98118043”所有文字和数字数据。我尝试了几种图像预处理方法，例如erode，dilate，resampling，morphological operations，blur，binarization，等等，但是结果的准确性仍然很低。还使用OpenCV通过按字符裁剪文本数据，通过绘制轮廓边界框方法进行尝试，但提取的准确性与扫描的内容不匹配。而且，没有正确提取附近签名的数据。

我猜想，图像预处理不会为我们提供更好的解决方案，因此需要使用tesseract-Ocr进行更改。

请建议我如何提高tesseract输出的质量。

以下是我到目前为止做过的以下方法。

Sample image

方法1：

import PyTesseract
import cv2
import numpy as np

image = cv2.imread('image.jpg')

kernel = np.array([[1,1,1],[1,1]])
opening = cv2.morphologyEx(image,cv2.MORPH_OPEN,kernel,iterations=2)
cv2.imwrite('opened.jpg',opening)

eroded = cv2.erode(image,iterations=2)
cv2.imwrite('eroded.jpg',eroded)
dilated = cv2.dilate(image,iterations=2)
cv2.imwrite('dilated.jpg',dilated)
blurred = cv2.blur(dilated,(2,2))
cv2.imwrite('blr.jpg',blurred)

resized = cv2.resize(blurred,None,fx=2,fy=2,interpolation=cv2.INTER_LINEAR)
cv2.imwrite('resized.jpg',resized)
text= PyTesseract.PyTesseract.image_to_string(blurred,config='--psm 4')
print(text)

方法2

import PyTesseract
import cv2

img = cv2.imread('image.jpg',0)
ret,thresh_value = cv2.threshold(img,180,255,cv2.THRESH_BINARY_INV)
kernel = np.ones((1,5),np.uint8)
dilated_value = cv2.dilate(thresh_value,iterations=1)
contours,hierarchy = cv2.findContours(dilated_value,cv2.RETR_TREE,cv2.CHAIN_APPROX_SIMPLE)

text = []
for i,cnt in enumerate(contours):
    if hierarchy[0,i,3] == -1:
        rect = cv2.minAreaRect(cnt)
        Boxs = cv2.BoxPoints(rect)
        Box = np.int0(Boxs)
        x,y,w,h = cv2.boundingRect(cnt)

        cv = img[y:y + h,x:x + w]

        tes = PyTesseract.image_to_string(cv,lang='eng',config='--psm 6')
        text.append(tes.lower())

        cv2.rectangle(img,(x,y),(x + w,y + h),(0,0),1)

        cv2.imwrite('res_bb.png',img)
print(text)

解决方法

暂无找到可以解决该程序问题的有效方法，小编努力寻找整理中！

如果你已经找到好的解决方法，欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@）

ocr opencv pdf-extraction python tesseract