使用子文件夹中的年份更新.txt文件

问题描述

当os.walk从一个目录中的文件切换到另一目录中的文件时，我正在尝试学习如何更新.txt文件名。我不确定如何执行此操作。我尝试遍历目录然后再遍历文件，但这没有成功，因为.pdf文件不会显示。这是我正在处理的代码的完整视图。

目录如下所示[研究]-> [2014]-> Article1.pdf，article2.pdf article3.pdf [2015]-> Article4.pdf，article5.pdf article6.pdf [2016]-> Article7.pdf，article8.pdf article9.pdf

from PIL import Image
import PyTesseract
from pdf2image import convert_from_path
import os


PyTesseract.PyTesseract.tesseract_cmd = r'/usr/local/Cellar/tesseract/4.1.1/bin/tesseract'

def image_ocr(image_path,output_txt_file_name,All_text):
    image_text = PyTesseract.image_to_string(
        image_path,lang='eng+ces',config='--psm 1')
    with open(output_txt_file_name,'a',encoding='utf-8') as f:
        f.write(image_text)
    with open(All_text,encoding='utf-8') as f:
        f.write(image_text)


num = 0
year = 1973

year_being_recorded = 'txt_files/' + str(year) + '_article.txt'
cumulative_text = 'txt_files/cumulative.txt'

for root,dirs,files in os.walk('articles'):
    for file_ in files:
        if file_.endswith('.pdf'):
            article_path = str(root) + '/' + str(file_)
            pages = convert_from_path(article_path,500)
            for page in pages:
                name = 'jpegs/a_file_' + str(num) + '.jpeg'
                page.save(name,'JPEG')
                image_ocr(name,year_being_recorded,cumulative_text)
                num = num + 1

解决方法

暂无找到可以解决该程序问题的有效方法，小编努力寻找整理中！

如果你已经找到好的解决方法，欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@）

python python-tesseract