Python OCR函数减小了图像的大小,如何解决此问题?

问题描述

我正在浏览文件夹并查找pdf。然后,我将这些PDF更改为文本。在通过OCR函数传递图像之前,我正在通过将图像转到灰度级并进行裁剪来进行一些图像处理,以便不存在某些美学元素。每个pdf的第一页与第二个PDF的最后一页稍有不同,因此每个PDF页面都通过if-else语句进行过滤。

通过OCR功能传递第一个JPEG可以在不同文档中完美地工作,但是每次我通过OCR功能传递JPEG时,它只会再次传递第一个文档图像。它创建第二个,第三个...,但仅将第一个jpeg传递给函数。我整个上午都在尝试调试,因此请原谅所有其他信息。任何帮助将不胜感激。

以下是通过OCR传递功能的结果。

executing first page number loop
(3000,2064)
(2064,2064)
executing this chunky piece of code
<class 'PIL.PpmImagePlugin.PpmImageFile'>
jpegs/file_1.jpeg
(1714,2064)
executing this chunky piece of code
<class 'PIL.PpmImagePlugin.PpmImageFile'>
jpegs/file_2.jpeg
(1714,2064)
executing this chunky piece of code
<class 'PIL.PpmImagePlugin.PpmImageFile'>
jpegs/file_3.jpeg
(1714,2064)
executing this chunky piece of code
<class 'PIL.PpmImagePlugin.PpmImageFile'>
jpegs/file_4.jpeg
(1714,2064)
executing this chunky piece of code
<class 'PIL.PpmImagePlugin.PpmImageFile'>
jpegs/file_5.jpeg
(1714,2064)
executing this chunky piece of code
<class 'PIL.PpmImagePlugin.PpmImageFile'>
jpegs/file_6.jpeg
(1714,2064)
executing this chunky piece of code
<class 'PIL.PpmImagePlugin.PpmImageFile'>
jpegs/file_7.jpeg
(1714,2064)
executing this chunky piece of code
<class 'PIL.PpmImagePlugin.PpmImageFile'>
jpegs/file_8.jpeg
(1714,2064)
executing this chunky piece of code
<class 'PIL.PpmImagePlugin.PpmImageFile'>
jpegs/file_9.jpeg
(1714,2064)
executing this chunky piece of code
<class 'PIL.PpmImagePlugin.PpmImageFile'>
jpegs/file_10.jpeg
(1714,2064)
11
0
executing first page number loop
(3000,2064)
executing this chunky piece of code
<class 'PIL.PpmImagePlugin.PpmImageFile'>
jpegs/file_12.jpeg
(1714,2064)
executing this chunky piece of code
<class 'PIL.PpmImagePlugin.PpmImageFile'>
jpegs/file_13.jpeg
(1714,2064)
executing this chunky piece of code
<class 'PIL.PpmImagePlugin.PpmImageFile'>
jpegs/file_14.jpeg
(1714,2064)
executing this chunky piece of code
<class 'PIL.PpmImagePlugin.PpmImageFile'>
jpegs/file_15.jpeg
(1714,2064)
executing this chunky piece of code
<class 'PIL.PpmImagePlugin.PpmImageFile'>
jpegs/file_16.jpeg
(1714,2064)
6
0```





              article_number = 0
saved_image_num = 0
text_file = 'txt_files/' + 'article'

print(saved_image_num)


for root,dirs,files in os.walk('articles'):
    for file_ in files:
        if file_.endswith('.pdf'):
            article_path = str(root) + '/' + str(file_)
            pages = convert_from_path(article_path,dpi=300)
            length_of_article = len(pages)
            page_number = 0
            for page in pages:
                if page_number == 0:
                    print('executing first page number loop')
                    name = 'jpegs/file_' + str(saved_image_num) + '.jpeg'
                    page.save(name,'JPEG')
                    saved_image_num += 1
                    page_number += 1
                    image = image_2_gray(name)
                    print(image.shape)
                    img = crop_page_1(image)
                    print(img.shape)
                    image_ocr(img,text_file + str(article_number) + '.txt')
                    if page_number == length_of_article:
                        article_number += 1
                        print(page_number)
                        page_number = page_number - length_of_article
                        print(page_number)

                elif page_number >= 1:
                    print('executing this chunky piece of code')
                    name_ = 'jpegs/file_' + str(saved_image_num) + '.jpeg'
                    page.save(name_,'JPEG')
                    print(type(page))
                    saved_image_num += 1
                    page_number += 1
                    print(name_)
                    img1 = crop_page_2_through_end(name_)
                    print(img1.shape)
                    image_ocr2(img1,text_file + str(article_number) + '.txt')
                    if page_number == length_of_article:
                        article_number += 1
                        print(page_number)
                        page_number = page_number - length_of_article
                        print(page_number)

解决方法

我的情况都糟透了。更改最终if条件的顺序。