问题描述
我使用 Tabula
抓取了这个 PDF,并使用以下代码创建了一个包含 multiple_tables=True
的 (1410) 表列表:
from tabula import read_pdf
df = read_pdf("~/Google Drive/DATA/978-1-912036-41-7-Who Owns Whom UK-Ireland-Volume-1.pdf",stream=True,pages='19-1428',guess=False,pandas_options={'header': None},encoding = 'ISO-8859-1',multiple_tables=True,columns=[210,400],area=[60,30,835,1000])
示例第一个表:
df[0000]
0 ... 2
0 NaN ... . Contrarian Group Limited England
1 NaN ... . Cornerstone Study Abroad Limited England
2 ? ... . Crudolife Limited England
3 NaN ... . Crystal Palace Physio Group Limited England
4 NaN ... . Daniels London Limited England
.. ... ... ...
140 . . Kharis Catering C.I.C. England ... . Pivotal Technologies Limited England
141 . . Kvm Limited England ... . Plus Black Limited England
142 . . London College Limited England ... . Plus Tyres Limited England
143 . . London College of Accounting & Finance Lim... ... . Portaplay Limited England
144 . .Millfield & Partners Ltd England ... . Portaplaypen Limited England
[145 rows x 3 columns]
问题
如何首先将每个表中的三列串联(一个在另一个之上)以获得一个单列,然后将 1410 个表串联成一个表?
我设法遍历表列表并打印一列,但我无法将结果放入数据框:
import numpy as np
import panda as pd
for x,res in enumerate(df):
print(np.ravel(res)[None].T)
我试过了:
for x,res in enumerate(df):
v = np.ravel(res)[None].T
result = pd.DataFrame(x,":",v,columns=['t'])
解决方法
暂无找到可以解决该程序问题的有效方法,小编努力寻找整理中!
如果你已经找到好的解决方法,欢迎将解决方案带上本链接一起发送给小编。
小编邮箱:dio#foxmail.com (将#修改为@)