如何在 Python 中从多个 PDF 中提取特定表格

问题描述

我有一个通过网络抓取下载的 PDF 文件数据库。我可以从这些 PDF 文件中提取表格并在 jupyter notebook 中将它们可视化，如下所示：

import os
import camelot.io as camelot
n = 1

arr = os.listdir('D:\Test')  # arr ist die Liste der PDF-Titel
for item in arr:
    tables = camelot.read_pdf(item,pages='all',split_text=True)
    print(f'''DATENblatT {n}: {item}

    ''')
    n += 1
    for tabs in tables:
        print(tabs.df,"\n==============================================================================\n")

这样我就得到了数据库中两个PDF文件的结果如下。

(PDf1,PDF2)

现在我想问一下如何才能从包含例如“电压”和“电流”信息的表中获取特定数据。更具体地说，我想提取用户定义或目标信息并使用这些值制作图表，而不是将它们整体打印。

提前致谢。

DATENblatT 1: HY-Energy-Plus-Peak-Pack-HYP-00-2972-R2.pdf

    
                         0                                                  1
0              Part Number                                        HYP-00-2972
1          Voltage Nominal                                              51.8V
2    Voltage Range Min/Max                                        43.4V/58.1V
3           Charge Current  160A maximum \nDe-rated by BMS message over CA...
4        discharge Current  300A maximum \nDe-rated by BMS message over CA...
5         Maximum Capacity                                    5.76kWh/111.4Ah
6   Maximum Energy Density                                           164Wh/kg
7         Useable capacity         Limited to 90% by BMS to improve cell life
8               Dimensions                       W: 243 x L: 352 x H: 300.5mm
9                   Weight                                               37kg
10       Mounting Fixtures     4x M8 mounting points for easy secure mounting
11                                                                            
==============================================================================

                                 0  \
0           Communication Protocol   
1             Reported information   
2        Pack Protection Mechanism   
3                 Balancing Method   
4             Multi-Pack BehavIoUr   
5  Compatible Chargers as standard   
6                  Charger Control   
7             Auxiliary Connectors   
8                 Power connectors   
9                                    

                                                   1  
0  CAN bus at user selectable baud rate (propriet...  
1  Cell Temperatures and Voltages,Pack Current,...  
2  Interlock to control external protection devic...  
3          Actively controlled dissipative balancing  
4  BMS implements a single master and multi-slave...  
5  Zivan,Victron,Delta-Q,TC-Charger,SPE. For ...  
6  Direct current control based on cell voltage/t...  
7              Binder 720-Series 8-way male & female  
8  4x Amphenol SurLok Plus 8mm \nWhen using batte...  
9                                                      
==============================================================================

                              0  \
0     Max no of packs in series   
1  Max Number of Parallel Packs   
2  External System Requirements   
3                                 

                                                   1  
0                                                 10  
1                                                127  
2  External Protection Device (e.g. Contactor) co...  
3                                                      
==============================================================================

DATENblatT 2: HY-Energy-Standard-Pack-HYP-00-2889-R2.pdf

    
                         0                                                  1
0              Part Number                                        HYP-00-2889
1          Voltage Nominal                                              44.4V
2    Voltage Range Min/Max                                        37.2V/49.8V
3           Charge Current  132A maximum \nDe-rated by BMS message over CA...
4        discharge Current  132A maximum \nDe-rated by BMS message over CA...
5         Maximum Capacity                                      4.94kWh/111Ah
6   Maximum Energy Density                                           152Wh/kg
7         Useable capacity         Limited to 90% by BMS to improve cell life
8               Dimensions                         W: 243 x L: 352 x H: 265mm
9                   Weight                                               32kg
10       Mounting Fixtures     4x M8 mounting points for easy secure mounting
11                                                                            
==============================================================================

                                 0  \
0           Communication Protocol   
1             Reported information   
2        Pack Protection Mechanism   
3                 Balancing Method   
4             Multi-Pack BehavIoUr   
5  Compatible Chargers as standard   
6                  Charger Control   
7             Auxiliary Connectors   
8                 Power connectors   
9                                    

                                                   1  
0  CAN bus at user selectable baud rate (propriet...  
1  Cell Temperatures and Voltages,SPE,Bass...  
6  Direct current control based on cell voltage/t...  
7              Binder 720-Series 8-way male & female  
8  4x Amphenol SurLok Plus 8mm \nWhen using batte...  
9                                                      
==============================================================================

                              0  \
0     Max no of packs in series   
1  Max Number of Parallel Packs   
2  External System Requirements   
3                                 

                                                   1  
0                                                 12  
1                                                127  
2  External Protection Device (e.g. Contactor) co...  
3                                                      
==============================================================================

解决方法

您可以定义感兴趣的字符串列表；

然后只选择至少包含这些字符串之一的表。

import os
import camelot.io as camelot
n = 1

# define your strings of interest
interesting_strings=["voltage","current"]

arr = os.listdir('D:\Test')  # arr ist die Liste der PDF-Titel
for item in arr:
    tables = camelot.read_pdf(item,pages='all',split_text=True)
    print(f'''DATENBLATT {n}: {item}

    ''')
    n += 1
    for tabs in tables:
        # select only tables which contain at least one of the interesting strings
        if any(s in tabs.df.to_string().lower() for s in interesting_strings) :
            print(tabs.df,"\n==============================================================================\n")

如果您只想在特定位置（例如，在第一列中）搜索有趣的字符串，则可以使用 Pandas 数据框属性，例如 iloc：

any(s in tabs.df.iloc[0].to_string().lower() for s in interesting_strings)

data-science extract extract python python-camelot

如何在 Python 中从多个 PDF 中提取特定表格

问题描述

解决方法

相关问答