问题描述
我想将 pdf 文档作为带有所有空格的文本阅读。 下面是api代码 我试过下面的链接 How to preserve document structure in tesseract
[Authorize]
[HttpPost,disableRequestSizeLimit]
[Route("ocr/extract-pdf")]
[responsecache(Location = responsecacheLocation.None,NoStore = true)]
public async Task<JsonResult> ExtractPDF(IFormFile file)
{
try
{
if (file == null)
{
return new JsonResult(new
{
code = HttpStatusCode.NotFound,messages = new string[] { "File not found" }
});
}
if (Path.GetExtension(file.FileName) != ".pdf")
{
return new JsonResult(new
{
code = HttpStatusCode.NotFound,messages = new string[] { "Invalid file extension,please uplaod .pdf file" }
});
}
string contentRootPath = _hostingEnvironment.ContentRootPath;
string filePath = await DocumentUtil.SaveFiletodisk(contentRootPath + "\\assets\\OCR-PDFS",file);
string text = TesseractOCRMapper.ExtractPDFUsingOCR(filePath);
return new JsonResult(new
{
code = HttpStatusCode.OK,data = text,messages = new string[] { "Data extracted successfully" }
});
}
catch (Exception ex)
{
_logger.LogError(this.GetType().Name + "." + Logger.GetCurrentMethod(),"Error saving data mapper: " + ex.Message,ex);
}
}
这是我的 ocr 类文件
public static class TesseractOCRMapper
{
public static string ExtractPDFUsingOCR(string filePath)
{
var documentText = new StringBuilder();
using (var pdf = new PdfDocument(filePath))
{
using (var engine = new TesseractEngine(@"tessdata","eng",EngineMode.Default))
{
for (int i = 0; i < pdf.PageCount; ++i)
{
if (documentText.Length > 0)
documentText.Append("\r\n\r\n");
pdfpage page = pdf.Pages[i];
string searchableText = page.GetText();
// Simple check if the page contains searchable text.
// We do not need to perform OCR in that case.
if (!string.IsNullOrEmpty(searchableText.Trim()))
{
documentText.Append(searchableText);
continue;
}
// This page is not searchable.
// Save the page as a high-resolution image
PdfdrawOptions options = PdfdrawOptions.Create();
options.BackgroundColor = new PdfRgbColor(255,255,255);
options.HorizontalResolution = 300;
options.VerticalResolution = 300;
string pageImage = $"page_{i}.png";
page.Save(pageImage,options);
// Perform OCR
using (Pix img = Pix.LoadFromFile(pageImage))
{
using (Page recognizedPage = engine.Process(img))
{
Console.WriteLine($"Mean confidence for page #{i}: {recognizedPage.GetMeanConfidence()}");
string recognizedText = recognizedPage.GetText();
documentText.Append(recognizedText);
}
}
File.Delete(pageImage);
}
}
}
using (var writer = new StreamWriter("result.txt"))
writer.Write(documentText.ToString());
DocumentUtil.RemoveFile(filePath);
return documentText.ToString();
}
}
我根据这些链接搜索了一些链接,我创建了名为 ocrSettins 的文件,并将其拍到了 tessdata/config 文件夹,在该文件中我添加了类似 preserve_interword_spaces 1
的行
但我仍然无法阅读带有空格的 pdf。是一个
解决方法
暂无找到可以解决该程序问题的有效方法,小编努力寻找整理中!
如果你已经找到好的解决方法,欢迎将解决方案带上本链接一起发送给小编。
小编邮箱:dio#foxmail.com (将#修改为@)