python – 如何使pandas HDFStore’put’操作更快

我正在尝试使用pandas,hdf5构建一个ETL工具包.

我的计划是

>将表从mysql提取到DataFrame;
>将此DataFrame放入hdfstore;

但是当我正在执行第2步时,我发现将数据帧放入* .h5文件需要花费太多时间.

>源MysqL服务器中表的大小:498MB

> 52列
> 924,624条记录

>将数据帧放入内容后的* .h5文件大小:513MB

>’put’操作费用为849.345677137秒

我的问题是:
这个时间成本是否正常?
有没有办法让它更快?

更新1

谢谢Jeff

>我的代码非常简单:

extract_store = hdfstore(‘extract_store.h5’)
extract_store [‘df_staff’] = df_staff
>当我尝试’ptdump -av file.h5’时,我收到了一个错误,但我仍然可以从这个h5文件中加载dataframe对象:

tables.exceptions.HDF5ExtError: HDF5 error back trace

File “../../../src/H5F.c”, line 1512, in H5Fopen
unable to open file File “../../../src/H5F.c”, line 1307, in H5F_open
unable to read superblock File “../../../src/H5Fsuper.c”, line 305, in H5F_super_read
unable to find file signature File “../../../src/H5Fsuper.c”, line 153, in H5F_locate_signature
unable to find a valid file signature

End of HDF5 error back trace

Unable to open/create file ‘extract_store.h5’

>其他一些信息:

>熊猫版:’0.10.0′
> os:ubuntu server 10.04 x86_64
> cpu:8 * Intel(R)Xeon(R)cpu X5670 @ 2.93GHz
> MemTotal:51634016 kB

我将把pandas更新为0.10.1-dev并再试一次.

更新2

>我已将熊猫更新为’0.10.1.dev-6e2b6ea’
>但是时间成本没有降低,这次花费884.15秒
>’ptdump -av file.h5’的输出是:

    / (RootGroup) ''  
      /._v_attrs (AttributeSet), 4 attributes:  
       [CLASS := 'GROUP',  
        PYTABLES_FORMAT_VERSION := '2.0',  
        TITLE := '',  
        VERSION := '1.0']  
    /df_bugs (Group) ''  
      /df_bugs._v_attrs (AttributeSet), 12 attributes:  
       [CLASS := 'GROUP',  
        TITLE := '',  
        VERSION := '1.0',  
        axis0_variety := 'regular',  
        axis1_variety := 'regular',  
        block0_items_variety := 'regular',  
        block1_items_variety := 'regular',  
        block2_items_variety := 'regular',  
        nblocks := 3,  
        ndim := 2,  
        pandas_type := 'frame',  
        pandas_version := '0.10.1']  
    /df_bugs/axis0 (Array(52,)) ''  
      atom := StringAtom(itemsize=19, shape=(), dflt='')  
      maindim := 0  
      flavor := 'numpy'  
      byteorder := 'irrelevant'  
      chunkshape := None  
      /df_bugs/axis0._v_attrs (AttributeSet), 7 attributes:  
       [CLASS := 'ARRAY',  
        FLAVOR := 'numpy',  
        TITLE := '',  
        VERSION := '2.3',  
        kind := 'string',  
        name := None,  
        transposed := True]  
    /df_bugs/axis1 (Array(924624,)) ''  
      atom := Int64Atom(shape=(), dflt=0)  
      maindim := 0  
      flavor := 'numpy'  
      byteorder := 'little'  
      chunkshape := None  
      /df_bugs/axis1._v_attrs (AttributeSet), 7 attributes:  
       [CLASS := 'ARRAY',  
        FLAVOR := 'numpy',  
        TITLE := '',  
        VERSION := '2.3',  
        kind := 'integer',  
        name := None,  
        transposed := True]  
    /df_bugs/block0_items (Array(5,)) ''  
      atom := StringAtom(itemsize=12, shape=(), dflt='')  
      maindim := 0   
      flavor := 'numpy'  
      byteorder := 'irrelevant'  
      chunkshape := None  
      /df_bugs/block0_items._v_attrs (AttributeSet), 7 attributes:  
       [CLASS := 'ARRAY',  
        FLAVOR := 'numpy',  
        TITLE := '',  
        VERSION := '2.3',  
        kind := 'string',  
        name := None,  
        transposed := True]  
    /df_bugs/block0_values (Array(924624, 5)) ''  
      atom := Float64Atom(shape=(), dflt=0.0)  
      maindim := 0  
      flavor := 'numpy'  
      byteorder := 'little'  
      chunkshape := None  
      /df_bugs/block0_values._v_attrs (AttributeSet), 5 attributes:  
       [CLASS := 'ARRAY',  
        FLAVOR := 'numpy',  
        TITLE := '',  
        VERSION := '2.3',  
        transposed := True]  
    /df_bugs/block1_items (Array(19,)) ''  
      atom := StringAtom(itemsize=19, shape=(), dflt='')  
      maindim := 0  
      flavor := 'numpy'  
      byteorder := 'irrelevant'  
      chunkshape := None  
      /df_bugs/block1_items._v_attrs (AttributeSet), 7 attributes:  
       [CLASS := 'ARRAY',  
        FLAVOR := 'numpy',  
        TITLE := '',  
        VERSION := '2.3',  
        kind := 'string',  
        name := None,  
        transposed := True]  
    /df_bugs/block1_values (Array(924624, 19)) ''  
      atom := Int64Atom(shape=(), dflt=0)  
      maindim := 0  
      flavor := 'numpy'  
      byteorder := 'little'  
      chunkshape := None  
      /df_bugs/block1_values._v_attrs (AttributeSet), 5 attributes:  
       [CLASS := 'ARRAY',  
        FLAVOR := 'numpy',  
        TITLE := '',   
        VERSION := '2.3',  
        transposed := True]  
    /df_bugs/block2_items (Array(28,)) ''  
      atom := StringAtom(itemsize=18, shape=(), dflt='')  
      maindim := 0  
      flavor := 'numpy'  
      byteorder := 'irrelevant'  
      chunkshape := None  
      /df_bugs/block2_items._v_attrs (AttributeSet), 7 attributes:  
       [CLASS := 'ARRAY',  
        FLAVOR := 'numpy',  
        TITLE := '',  
        VERSION := '2.3',
        kind := 'string',  
        name := None,  
        transposed := True]  
    /df_bugs/block2_values (VLArray(1,)) ''  
      atom = ObjectAtom()  
      byteorder = 'irrelevant'  
      nrows = 1  
      flavor = 'numpy'  
      /df_bugs/block2_values._v_attrs (AttributeSet), 5 attributes:  
       [CLASS := 'VLARRAY',  
        PSEUDOATOM := 'object',  
        TITLE := '',   
        VERSION := '1.3',  
        transposed := True]  

>我在下面尝试了你的代码(将数据框放入hdfstore,param’table’为True),但是却出现了错误,似乎不支持python的datatime类型:

Exception: cannot find the correct atom type -> [dtype->object] object
of type ‘datetime.datetime’ has no len()

更新3

谢谢杰夫.
抱歉耽搁了.

> tables.version:’2.4.0′.
>是的,884秒只是没有来自MysqL的pull操作的put操作成本
>一行数据帧(df.ix [0]):

bug_id                                   1
assigned_to                            185
bug_file_loc                          None
bug_severity                      critical
bug_status                          closed
creation_ts            1998-05-06 21:27:00
delta_ts               2012-05-09 14:41:41
short_desc                    Two cursors.
host_op_sys                        UnkNown
guest_op_sys                       UnkNown
priority                                P3
rep_platform                          IA32
reporter                                56
product_id                               7
category_id                            983
component_id                         12925
resolution                           fixed
target_milestone                       ws1
qa_contact                             412
status_whiteboard                         
Votes                                    0
keywords                                SR
lastdiffed             2012-05-09 14:41:41
everconfirmed                            1
reporter_accessible                      1
cclist_accessible                        1
estimated_time                        0.00
remaining_time                        0.00
deadline                              None
alias                                 None
found_in_product_id                      0
found_in_version_id                      0
found_in_phase_id                        0
cf_type                             Defect
cf_reported_by                 Development
cf_attempted                           NaN
cf_Failed                              NaN
cf_public_summary                         
cf_doc_impact                            0
cf_security                              0
cf_build                               NaN
cf_branch                                 
cf_change                              NaN
cf_test_id                             NaN
cf_regression                      UnkNown
cf_reviewer                              0
cf_on_hold                               0
cf_public_severity                     ---
cf_i18n_impact                           0
cf_eta                                None
cf_bug_source                          ---
cf_viss                               None
Name: 0, Length: 52

>数据帧的图片(只需在ipython notebook中输入’df’):


Int64Index: 924624 entries, 0 to 924623
Data columns:
bug_id                 924624  non-null values
assigned_to            924624  non-null values
bug_file_loc           427318  non-null values
bug_severity           924624  non-null values
bug_status             924624  non-null values
creation_ts            924624  non-null values
delta_ts               924624  non-null values
short_desc             924624  non-null values
host_op_sys            924624  non-null values
guest_op_sys           924624  non-null values
priority               924624  non-null values
rep_platform           924624  non-null values
reporter               924624  non-null values
product_id             924624  non-null values
category_id            924624  non-null values
component_id           924624  non-null values
resolution             924624  non-null values
target_milestone       924624  non-null values
qa_contact             924624  non-null values
status_whiteboard      924624  non-null values
Votes                  924624  non-null values
keywords               924624  non-null values
lastdiffed             924509  non-null values
everconfirmed          924624  non-null values
reporter_accessible    924624  non-null values
cclist_accessible      924624  non-null values
estimated_time         924624  non-null values
remaining_time         924624  non-null values
deadline               0  non-null values
alias                  0  non-null values
found_in_product_id    924624  non-null values
found_in_version_id    924624  non-null values
found_in_phase_id      924624  non-null values
cf_type                924624  non-null values
cf_reported_by         924624  non-null values
cf_attempted           89622  non-null values
cf_Failed              89587  non-null values
cf_public_summary      510799  non-null values
cf_doc_impact          924624  non-null values
cf_security            924624  non-null values
cf_build               327460  non-null values
cf_branch              614929  non-null values
cf_change              300612  non-null values
cf_test_id             12610  non-null values
cf_regression          924624  non-null values
cf_reviewer            924624  non-null values
cf_on_hold             924624  non-null values
cf_public_severity     924624  non-null values
cf_i18n_impact         924624  non-null values
cf_eta                 3910  non-null values
cf_bug_source          924624  non-null values
cf_viss                725  non-null values
dtypes: float64(5), int64(19), object(28)

>’convert_objects()’之后:

dtypes: datetime64[ns](2), float64(5), int64(19), object(26)

>并将转换后的数据帧放入hdfstore成本:749.50 s

相关文章

转载:一文讲述Pandas库的数据读取、数据获取、数据拼接、数...
Pandas是一个开源的第三方Python库,从Numpy和Matplotlib的基...
整体流程登录天池在线编程环境导入pandas和xrld操作EXCEL文件...
 一、numpy小结             二、pandas2.1为...
1、时间偏移DateOffset对象DateOffset类似于时间差Timedelta...
1、pandas内置样式空值高亮highlight_null最大最小值高亮背景...