使用 Amazon textract 将所有表格数据从 PDF 导出到 Excel

问题描述

希望使用 Amazon Textract 将 PDF 数据提取到 Excel/CSV。我们如何从本地文件夹插入输入的 PDF 数据。

有多个表格的 PDF,我们需要从各自的页面提取所有表格并将数据导出到 CSV/Excel 文件。可用于进一步分析。

从 AWS 收到的一段代码,但无法理解如何将输入的 pdf 文件放入脚本中。

import webbrowser,os
import json
import boto3
import io
from io import BytesIO
import sys
from pprint import pprint


def get_rows_columns_map(table_result,blocks_map):
    rows = {}
    for relationship in table_result['Relationships']:
        if relationship['Type'] == 'CHILD':
            for child_id in relationship['Ids']:
                cell = blocks_map[child_id]
                if cell['BlockType'] == 'CELL':
                    row_index = cell['RowIndex']
                    col_index = cell['ColumnIndex']
                    if row_index not in rows:
                        # create new row
                        rows[row_index] = {}
                        
                    # get the text value
                    rows[row_index][col_index] = get_text(cell,blocks_map)
    return rows


def get_text(result,blocks_map):
    text = ''
    if 'Relationships' in result:
        for relationship in result['Relationships']:
            if relationship['Type'] == 'CHILD':
                for child_id in relationship['Ids']:
                    word = blocks_map[child_id]
                    if word['BlockType'] == 'WORD':
                        text += word['Text'] + ' '
                    if word['BlockType'] == 'SELECTION_ELEMENT':
                        if word['SelectionStatus'] =='SELECTED':
                            text +=  'X '    
    return text


def get_table_csv_results(file_name):

    with open(file_name,'rb') as file:
        img_test = file.read()
        bytes_test = bytearray(img_test)
        print('Image loaded',file_name)

    # process using image bytes
    # get the results
    client = boto3.client('textract')

    response = client.analyze_document(Document={'Bytes': bytes_test},FeatureTypes=['TABLES'])

    # Get the text blocks
    blocks=response['Blocks']
    pprint(blocks)

    blocks_map = {}
    table_blocks = []
    for block in blocks:
        blocks_map[block['Id']] = block
        if block['BlockType'] == "TABLE":
            table_blocks.append(block)

    if len(table_blocks) <= 0:
        return "<b> NO Table FOUND </b>"

    csv = ''
    for index,table in enumerate(table_blocks):
        csv += generate_table_csv(table,blocks_map,index +1)
        csv += '\n\n'

    return csv

def generate_table_csv(table_result,table_index):
    rows = get_rows_columns_map(table_result,blocks_map)

    table_id = 'Table_' + str(table_index)
    
    # get cells.
    csv = 'Table: {0}\n\n'.format(table_id)

    for row_index,cols in rows.items():
        
        for col_index,text in cols.items():
            csv += '{}'.format(text) + ","
        csv += '\n'
        
    csv += '\n\n\n'
    return csv

def main(file_name):
    table_csv = get_table_csv_results(file_name)

    output_file = 'output.csv'

    # replace content
    with open(output_file,"wt") as fout:
        fout.write(table_csv)

    # show the results
    print('CSV OUTPUT FILE: ',output_file)


if __name__ == "__main__":
    file_name = sys.argv[1]
    main(file_name)

示例 PDF 文件 Click Here

解决方法

首先你必须在 aws 中生成必要的环境,安装 awscli 并使用你的 aws 凭据配置它,这样,你只需要安装相应的库并更改最后一行代码:

if __name__ == "__main__": file_name = "name_image.png" main(file_name)

我建议您阅读此出版物,以设置您的 aws 环境:

https://medium.com/@victorjatoba10/extract-tables-and-forms-from-pdf-using-amazon-aws-textract-827c6e866453