问题描述
我有使用Tesseract OCR从扫描的pdf文件/普通pdf文件中提取/转换文本的代码。但是我想让我的代码转换一个pdf文件夹而不是单个pdf文件,然后将提取的文本文件存储在我想要的文件夹中。
请参阅下面的代码:
filePath = '/Users/CodingStark/scanned/scanned-file.pdf'
pages = convert_from_path(filePath,500)
image_counter = 1
# Iterate through all the pages stored above
for page in pages:
filename = "page_"+str(image_counter)+".jpg"
page.save(filename,'JPEG')
image_counter = image_counter + 1
filelimit = image_counter-1
# Creating a text file to write the output
outfile = "scanned-file.txt"
f = open(outfile,"a")
# Iterate from 1 to total number of pages
for i in range(1,filelimit + 1):
filename = "page_"+str(i)+".jpg"
# Recognize the text as string in image using PyTesserct
text = str(((PyTesseract.image_to_string(Image.open(filename)))))
text = text.replace('-\n','')
f.write(text)
#Close the file after writing all the text.
f.close()
我想使我的代码自动化,以便它将所有我的pdf文件转换为扫描的文件夹,而那些提取的文本文件将位于我想要的文件夹中。另外,有什么方法可以在代码之后删除所有jpg文件?由于占用大量内存空间。非常感谢!!
解决方法
这是从路径读取的循环,
import glob,os
import os,subprocess
pdf_dir = "dir"
os.chdir(pdf_dir)
for pdf_file in glob.glob(os.path.join(pdf_dir,"*.PDF")):
//// put here what you want to do for each pdf file