问题描述
我有一个通过网络抓取下载的 PDF 文件数据库。我可以从这些 PDF 文件中提取表格并在 jupyter notebook 中将它们可视化,如下所示:
import os
import camelot.io as camelot
n = 1
arr = os.listdir('D:\Test') # arr ist die Liste der PDF-Titel
for item in arr:
tables = camelot.read_pdf(item,pages='all',split_text=True)
print(f'''DATENblatT {n}: {item}
''')
n += 1
for tabs in tables:
print(tabs.df,"\n==============================================================================\n")
现在我想问一下如何才能从包含例如“电压”和“电流”信息的表中获取特定数据。更具体地说,我想提取用户定义或目标信息并使用这些值制作图表,而不是将它们整体打印。
提前致谢。
DATENblatT 1: HY-Energy-Plus-Peak-Pack-HYP-00-2972-R2.pdf
0 1
0 Part Number HYP-00-2972
1 Voltage Nominal 51.8V
2 Voltage Range Min/Max 43.4V/58.1V
3 Charge Current 160A maximum \nDe-rated by BMS message over CA...
4 discharge Current 300A maximum \nDe-rated by BMS message over CA...
5 Maximum Capacity 5.76kWh/111.4Ah
6 Maximum Energy Density 164Wh/kg
7 Useable capacity Limited to 90% by BMS to improve cell life
8 Dimensions W: 243 x L: 352 x H: 300.5mm
9 Weight 37kg
10 Mounting Fixtures 4x M8 mounting points for easy secure mounting
11
==============================================================================
0 \
0 Communication Protocol
1 Reported information
2 Pack Protection Mechanism
3 Balancing Method
4 Multi-Pack BehavIoUr
5 Compatible Chargers as standard
6 Charger Control
7 Auxiliary Connectors
8 Power connectors
9
1
0 CAN bus at user selectable baud rate (propriet...
1 Cell Temperatures and Voltages,Pack Current,...
2 Interlock to control external protection devic...
3 Actively controlled dissipative balancing
4 BMS implements a single master and multi-slave...
5 Zivan,Victron,Delta-Q,TC-Charger,SPE. For ...
6 Direct current control based on cell voltage/t...
7 Binder 720-Series 8-way male & female
8 4x Amphenol SurLok Plus 8mm \nWhen using batte...
9
==============================================================================
0 \
0 Max no of packs in series
1 Max Number of Parallel Packs
2 External System Requirements
3
1
0 10
1 127
2 External Protection Device (e.g. Contactor) co...
3
==============================================================================
DATENblatT 2: HY-Energy-Standard-Pack-HYP-00-2889-R2.pdf
0 1
0 Part Number HYP-00-2889
1 Voltage Nominal 44.4V
2 Voltage Range Min/Max 37.2V/49.8V
3 Charge Current 132A maximum \nDe-rated by BMS message over CA...
4 discharge Current 132A maximum \nDe-rated by BMS message over CA...
5 Maximum Capacity 4.94kWh/111Ah
6 Maximum Energy Density 152Wh/kg
7 Useable capacity Limited to 90% by BMS to improve cell life
8 Dimensions W: 243 x L: 352 x H: 265mm
9 Weight 32kg
10 Mounting Fixtures 4x M8 mounting points for easy secure mounting
11
==============================================================================
0 \
0 Communication Protocol
1 Reported information
2 Pack Protection Mechanism
3 Balancing Method
4 Multi-Pack BehavIoUr
5 Compatible Chargers as standard
6 Charger Control
7 Auxiliary Connectors
8 Power connectors
9
1
0 CAN bus at user selectable baud rate (propriet...
1 Cell Temperatures and Voltages,SPE,Bass...
6 Direct current control based on cell voltage/t...
7 Binder 720-Series 8-way male & female
8 4x Amphenol SurLok Plus 8mm \nWhen using batte...
9
==============================================================================
0 \
0 Max no of packs in series
1 Max Number of Parallel Packs
2 External System Requirements
3
1
0 12
1 127
2 External Protection Device (e.g. Contactor) co...
3
==============================================================================
解决方法
您可以定义感兴趣的字符串列表;
然后只选择至少包含这些字符串之一的表。
import os
import camelot.io as camelot
n = 1
# define your strings of interest
interesting_strings=["voltage","current"]
arr = os.listdir('D:\Test') # arr ist die Liste der PDF-Titel
for item in arr:
tables = camelot.read_pdf(item,pages='all',split_text=True)
print(f'''DATENBLATT {n}: {item}
''')
n += 1
for tabs in tables:
# select only tables which contain at least one of the interesting strings
if any(s in tabs.df.to_string().lower() for s in interesting_strings) :
print(tabs.df,"\n==============================================================================\n")
如果您只想在特定位置(例如,在第一列中)搜索有趣的字符串,则可以使用 Pandas 数据框属性,例如 iloc
:
any(s in tabs.df.iloc[0].to_string().lower() for s in interesting_strings)