TypeError:__init __接受1个位置参数,但给出了2个使用Pytesseract的Python多处理

问题描述

当尝试将Python的multiprocessing库与PyTesseractpdf2image一起使用时,我收到以下错误消息,但我不确定它的含义或如何纠正它。我在其他类似的输出消息中看到的帖子涉及将self作为类的方法中的参数传递,但是我没有在此实例中创建类。

C:\Users\erik7>python "C:\Users\erik7\Documents\Python Projects\multiprocess_test2.py"
0
Exception in thread Thread-11:
Traceback (most recent call last):
  File "C:\Users\erik7\AppData\Local\Programs\Python\python38-32\lib\threading.py",line 932,in _bootstrap_inner
    self.run()
  File "C:\Users\erik7\AppData\Local\Programs\Python\python38-32\lib\threading.py",line 870,in run
    self._target(*self._args,**self._kwargs)
  File "C:\Users\erik7\AppData\Local\Programs\Python\python38-32\lib\multiprocessing\pool.py",line 576,in _handle_results
    task = get()
  File "C:\Users\erik7\AppData\Local\Programs\Python\python38-32\lib\multiprocessing\connection.py",line 251,in recv
    return _ForkingPickler.loads(buf.getbuffer())
TypeError: __init__() takes 1 positional argument but 2 were given
1
2
3
4
5
6
7
8
9

我的代码

import PyTesseract
import pdf2image
import multiprocessing


def extract(img,page_num):
    
    print(page_num)
    
    return PyTesseract.image_to_osd(img,output_type = PyTesseract.Output.DICT)['orientaton']


if __name__ == "__main__":

    pdf_path = r"C:/Users/erik7/Documents/Late Scans for Testing/scans_template2.pdf"
    output_fmt = 'jpeg'
    img_dpi = 300
    pop_path = r"C:\Users\erik7\Downloads\poppler-0.90.1\bin"
    output_path = r"C:\Users\erik7\Downloads"
    
    PyTesseract.PyTesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"
    
    converted_path = r"C:\Users\erik7\Downloads\converted_images"
    converted = pdf2image.convert_from_path(pdf_path = pdf_path,fmt = output_fmt,dpi = img_dpi,poppler_path = pop_path,output_folder = converted_path,grayscale = True,thread_count = 2)

    results = [] 
    
    iterable = [[img,page_num] for page_num,img in enumerate(converted)]
    p = multiprocessing.Pool()
    r = p.starmap(extract,iterable)
    results.append(r)
    p.close()
    
    print("\n**PROCESS COMPLETED SUCCESSFULLY")

解决方法

使其正常工作。我需要将pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"移动到我的extract函数中,并且该程序必须能够使用multiprocessing成功运行:

import pytesseract
import pdf2image
import multiprocessing


def extract(img,page_num):
    
    print(page_num)
    
    return pytesseract.image_to_osd(img,output_type = pytesseract.Output.DICT)['orientaton']


if __name__ == "__main__":

    pdf_path = r"C:/Users/erik7/Documents/Late Scans for Testing/scans_template2.pdf"
    output_fmt = 'jpeg'
    img_dpi = 300
    pop_path = r"C:\Users\erik7\Downloads\poppler-0.90.1\bin"
    output_path = r"C:\Users\erik7\Downloads"
    
    converted_path = r"C:\Users\erik7\Downloads\converted_images"
    converted = pdf2image.convert_from_path(pdf_path = pdf_path,fmt = output_fmt,dpi = img_dpi,poppler_path = pop_path,output_folder = converted_path,grayscale = True,thread_count = 2)

    results = [] 
    
    iterable = [[img,page_num] for page_num,img in enumerate(converted)]
    p = multiprocessing.Pool()
    r = p.starmap(extract,iterable)
    results.append(r)
    p.close()
    
    print("\n**PROCESS COMPLETED SUCCESSFULLY")