使用tessaract ocr在asp.net核心中阅读pdf文档时如何保留空格

问题描述

我想将 pdf 文档作为带有所有空格的文本阅读。下面是api代码我试过下面的链接 How to preserve document structure in tesseract

 [Authorize]
        [HttpPost,disableRequestSizeLimit]
        [Route("ocr/extract-pdf")]
        [responsecache(Location = responsecacheLocation.None,NoStore = true)]
        public async Task<JsonResult> ExtractPDF(IFormFile file)
        {

            try
            {

                if (file == null)
                {
                    return new JsonResult(new
                    {
                        code = HttpStatusCode.NotFound,messages = new string[] { "File not found" }
                    });
                }
                if (Path.GetExtension(file.FileName) != ".pdf")
                {
                    return new JsonResult(new
                    {
                        code = HttpStatusCode.NotFound,messages = new string[] { "Invalid file extension,please uplaod .pdf file" }
                    });
                }

                string contentRootPath = _hostingEnvironment.ContentRootPath;
                
                string filePath = await DocumentUtil.SaveFiletodisk(contentRootPath + "\\assets\\OCR-PDFS",file);
                string text = TesseractOCRMapper.ExtractPDFUsingOCR(filePath);
                

                return new JsonResult(new
                {
                    code = HttpStatusCode.OK,data = text,messages = new string[] { "Data extracted successfully" }
                });

            }
            catch (Exception ex)
            {
                _logger.LogError(this.GetType().Name + "." + Logger.GetCurrentMethod(),"Error saving data mapper: " + ex.Message,ex);    
               
            }
        }

这是我的 ocr 类文件

public static class TesseractOCRMapper
    {
        public static string ExtractPDFUsingOCR(string filePath)
        {
            var documentText = new StringBuilder();

            using (var pdf = new PdfDocument(filePath))
            {
                using (var engine = new TesseractEngine(@"tessdata","eng",EngineMode.Default))
                {
                    for (int i = 0; i < pdf.PageCount; ++i)
                    {
                        if (documentText.Length > 0)
                            documentText.Append("\r\n\r\n");

                        pdfpage page = pdf.Pages[i];
                        string searchableText = page.GetText();

                        // Simple check if the page contains searchable text.
                        // We do not need to perform OCR in that case.
                        if (!string.IsNullOrEmpty(searchableText.Trim()))
                        {
                            documentText.Append(searchableText);
                            continue;
                        }

                        // This page is not searchable.
                        // Save the page as a high-resolution image
                        PdfdrawOptions options = PdfdrawOptions.Create();
                        options.BackgroundColor = new PdfRgbColor(255,255,255);
                        options.HorizontalResolution = 300;
                        options.VerticalResolution = 300;

                        string pageImage = $"page_{i}.png";
                        page.Save(pageImage,options);

                        // Perform OCR
                        using (Pix img = Pix.LoadFromFile(pageImage))
                        {
                            using (Page recognizedPage = engine.Process(img))
                            {
                                Console.WriteLine($"Mean confidence for page #{i}: {recognizedPage.GetMeanConfidence()}");

                                string recognizedText = recognizedPage.GetText();
                                documentText.Append(recognizedText);
                            }
                        }

                        File.Delete(pageImage);
                    }
                }
            }

            using (var writer = new StreamWriter("result.txt"))
                writer.Write(documentText.ToString());

            DocumentUtil.RemoveFile(filePath);

            return documentText.ToString();
        }
    }

我根据这些链接搜索了一些链接，我创建了名为 ocrSettins 的文件，并将其拍到了 tessdata/config 文件夹，在该文件中我添加了类似 preserve_interword_spaces 1 的行但我仍然无法阅读带有空格的 pdf。是一个

解决方法

暂无找到可以解决该程序问题的有效方法，小编努力寻找整理中！

如果你已经找到好的解决方法，欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@）

asp.net core-api tesseract tesseract tesseract