问题描述
我正在浏览文件夹并查找pdf。然后,我将这些PDF更改为文本。在通过OCR函数传递图像之前,我正在通过将图像转到灰度级并进行裁剪来进行一些图像处理,以便不存在某些美学元素。每个pdf的第一页与第二个PDF的最后一页稍有不同,因此每个PDF页面都通过if-else语句进行过滤。
通过OCR功能传递第一个JPEG可以在不同文档中完美地工作,但是每次我通过OCR功能传递JPEG时,它只会再次传递第一个文档图像。它创建第二个,第三个...,但仅将第一个jpeg传递给函数。我整个上午都在尝试调试,因此请原谅所有其他信息。任何帮助将不胜感激。
以下是通过OCR传递功能的结果。
executing first page number loop
(3000,2064)
(2064,2064)
executing this chunky piece of code
<class 'PIL.PpmImagePlugin.PpmImageFile'>
jpegs/file_1.jpeg
(1714,2064)
executing this chunky piece of code
<class 'PIL.PpmImagePlugin.PpmImageFile'>
jpegs/file_2.jpeg
(1714,2064)
executing this chunky piece of code
<class 'PIL.PpmImagePlugin.PpmImageFile'>
jpegs/file_3.jpeg
(1714,2064)
executing this chunky piece of code
<class 'PIL.PpmImagePlugin.PpmImageFile'>
jpegs/file_4.jpeg
(1714,2064)
executing this chunky piece of code
<class 'PIL.PpmImagePlugin.PpmImageFile'>
jpegs/file_5.jpeg
(1714,2064)
executing this chunky piece of code
<class 'PIL.PpmImagePlugin.PpmImageFile'>
jpegs/file_6.jpeg
(1714,2064)
executing this chunky piece of code
<class 'PIL.PpmImagePlugin.PpmImageFile'>
jpegs/file_7.jpeg
(1714,2064)
executing this chunky piece of code
<class 'PIL.PpmImagePlugin.PpmImageFile'>
jpegs/file_8.jpeg
(1714,2064)
executing this chunky piece of code
<class 'PIL.PpmImagePlugin.PpmImageFile'>
jpegs/file_9.jpeg
(1714,2064)
executing this chunky piece of code
<class 'PIL.PpmImagePlugin.PpmImageFile'>
jpegs/file_10.jpeg
(1714,2064)
11
0
executing first page number loop
(3000,2064)
executing this chunky piece of code
<class 'PIL.PpmImagePlugin.PpmImageFile'>
jpegs/file_12.jpeg
(1714,2064)
executing this chunky piece of code
<class 'PIL.PpmImagePlugin.PpmImageFile'>
jpegs/file_13.jpeg
(1714,2064)
executing this chunky piece of code
<class 'PIL.PpmImagePlugin.PpmImageFile'>
jpegs/file_14.jpeg
(1714,2064)
executing this chunky piece of code
<class 'PIL.PpmImagePlugin.PpmImageFile'>
jpegs/file_15.jpeg
(1714,2064)
executing this chunky piece of code
<class 'PIL.PpmImagePlugin.PpmImageFile'>
jpegs/file_16.jpeg
(1714,2064)
6
0```
article_number = 0
saved_image_num = 0
text_file = 'txt_files/' + 'article'
print(saved_image_num)
for root,dirs,files in os.walk('articles'):
for file_ in files:
if file_.endswith('.pdf'):
article_path = str(root) + '/' + str(file_)
pages = convert_from_path(article_path,dpi=300)
length_of_article = len(pages)
page_number = 0
for page in pages:
if page_number == 0:
print('executing first page number loop')
name = 'jpegs/file_' + str(saved_image_num) + '.jpeg'
page.save(name,'JPEG')
saved_image_num += 1
page_number += 1
image = image_2_gray(name)
print(image.shape)
img = crop_page_1(image)
print(img.shape)
image_ocr(img,text_file + str(article_number) + '.txt')
if page_number == length_of_article:
article_number += 1
print(page_number)
page_number = page_number - length_of_article
print(page_number)
elif page_number >= 1:
print('executing this chunky piece of code')
name_ = 'jpegs/file_' + str(saved_image_num) + '.jpeg'
page.save(name_,'JPEG')
print(type(page))
saved_image_num += 1
page_number += 1
print(name_)
img1 = crop_page_2_through_end(name_)
print(img1.shape)
image_ocr2(img1,text_file + str(article_number) + '.txt')
if page_number == length_of_article:
article_number += 1
print(page_number)
page_number = page_number - length_of_article
print(page_number)
解决方法
我的情况都糟透了。更改最终if条件的顺序。