问题描述
我有一个通过网络抓取下载的 PDF 文件数据库。我可以从这些 PDF 文件中提取表格并在 jupyter notebook 中将它们可视化,如下所示:
import os import camelot.io as camelot n = 1 arr = os.listdir('D:\Test') # arr ist die Liste der PDF-Titel for item in arr: tables = camelot.read_pdf(item,pages='all',split_text=True) print(f'''DATENblatT {n}: {item} ''') n += 1 for tabs in tables: print(tabs.df,"\n==============================================================================\n")
@H_404_5@现在我想问一下如何才能从包含例如“电压”和“电流”信息的表中获取特定数据。更具体地说,我想提取用户定义或目标信息并使用这些值制作图表,而不是将它们整体打印。
提前致谢。
DATENblatT 1: HY-Energy-Plus-Peak-Pack-HYP-00-2972-R2.pdf 0 1 0 Part Number HYP-00-2972 1 Voltage Nominal 51.8V 2 Voltage Range Min/Max 43.4V/58.1V 3 Charge Current 160A maximum \nDe-rated by BMS message over CA... 4 discharge Current 300A maximum \nDe-rated by BMS message over CA... 5 Maximum Capacity 5.76kWh/111.4Ah 6 Maximum Energy Density 164Wh/kg 7 Useable capacity Limited to 90% by BMS to improve cell life 8 Dimensions W: 243 x L: 352 x H: 300.5mm 9 Weight 37kg 10 Mounting Fixtures 4x M8 mounting points for easy secure mounting 11 ============================================================================== 0 \ 0 Communication Protocol 1 Reported @R_699_4045@ion 2 Pack Protection Mechanism 3 Balancing Method 4 Multi-Pack BehavIoUr 5 Compatible Chargers as standard 6 Charger Control 7 Auxiliary Connectors 8 Power connectors 9 1 0 CAN bus at user selectable baud rate (propriet... 1 Cell Temperatures and Voltages,Pack Current,... 2 Interlock to control external protection devic... 3 Actively controlled dissipative balancing 4 BMS implements a single master and multi-slave... 5 Zivan,Victron,Delta-Q,TC-Charger,SPE. For ... 6 Direct current control based on cell voltage/t... 7 Binder 720-Series 8-way male & female 8 4x Amphenol SurLok Plus 8mm \nWhen using batte... 9 ============================================================================== 0 \ 0 Max no of packs in series 1 Max Number of Parallel Packs 2 External System Requirements 3 1 0 10 1 127 2 External Protection Device (e.g. Contactor) co... 3 ============================================================================== DATENblatT 2: HY-Energy-Standard-Pack-HYP-00-2889-R2.pdf 0 1 0 Part Number HYP-00-2889 1 Voltage Nominal 44.4V 2 Voltage Range Min/Max 37.2V/49.8V 3 Charge Current 132A maximum \nDe-rated by BMS message over CA... 4 discharge Current 132A maximum \nDe-rated by BMS message over CA... 5 Maximum Capacity 4.94kWh/111Ah 6 Maximum Energy Density 152Wh/kg 7 Useable capacity Limited to 90% by BMS to improve cell life 8 Dimensions W: 243 x L: 352 x H: 265mm 9 Weight 32kg 10 Mounting Fixtures 4x M8 mounting points for easy secure mounting 11 ============================================================================== 0 \ 0 Communication Protocol 1 Reported @R_699_4045@ion 2 Pack Protection Mechanism 3 Balancing Method 4 Multi-Pack BehavIoUr 5 Compatible Chargers as standard 6 Charger Control 7 Auxiliary Connectors 8 Power connectors 9 1 0 CAN bus at user selectable baud rate (propriet... 1 Cell Temperatures and Voltages,SPE,Bass... 6 Direct current control based on cell voltage/t... 7 Binder 720-Series 8-way male & female 8 4x Amphenol SurLok Plus 8mm \nWhen using batte... 9 ============================================================================== 0 \ 0 Max no of packs in series 1 Max Number of Parallel Packs 2 External System Requirements 3 1 0 12 1 127 2 External Protection Device (e.g. Contactor) co... 3 ==============================================================================
@H_404_5@解决方法
您可以定义感兴趣的字符串列表;
然后只选择至少包含这些字符串之一的表。
import os import camelot.io as camelot n = 1 # define your strings of interest interesting_strings=["voltage","current"] arr = os.listdir('D:\Test') # arr ist die Liste der PDF-Titel for item in arr: tables = camelot.read_pdf(item,pages='all',split_text=True) print(f'''DATENBLATT {n}: {item} ''') n += 1 for tabs in tables: # select only tables which contain at least one of the interesting strings if any(s in tabs.df.to_string().lower() for s in interesting_strings) : print(tabs.df,"\n==============================================================================\n")
如果您只想在特定位置(例如,在第一列中)搜索有趣的字符串,则可以使用 Pandas 数据框属性,例如
iloc
:any(s in tabs.df.iloc[0].to_string().lower() for s in interesting_strings)