有没有一种更有效的方法可以根据列的内容从PyArrow表中选择行？

问题描述

我问a related question，这是一种基于列内容从PyArrow表中选择行的更惯用的方法。 @joris的答案看起来不错。但是，看起来仅在带有行掩码的PyArrow中选择行存在稀疏选择的性能问题。

是否有一种更有效的方法来执行此操作，但仅将其留在PyArrow中而不在PyArrow和numpy之间来回移动？

测试用例：

import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd
import numpy as np

# Example table for data schema:
# Alternating rows with index 0 and 1
irow = np.arange(2**20)
dt = 17
df0 = pd.DataFrame({'timestamp': np.array((irow//2)*dt,dtype=np.int64),'index':     np.array(irow%2,dtype=np.int16),'value':     np.array(irow*0,dtype=np.int32)},columns=['timestamp','index','value'])
ii = df0['index'] == 0
df0.loc[ii,'value'] = irow[ii]//2
ii = df0['index'] == 1
df0.loc[ii,'value'] = (np.sin(df0.loc[ii,'timestamp']*0.01)*10000).astype(np.int32)

# Insert rows with index 2 every 16 timestamps
irow = np.arange(10000)
subsample = 16
df1 = pd.DataFrame({'timestamp': np.array(irow*dt*subsample,'index':     np.full_like(irow,2,'value':     np.array(irow*irow,'value'],index=irow*subsample*2+1.5)
df2=pd.concat([df0,df1]).sort_index()
df2.index = pd.RangeIndex(len(df2))
print(df2)
table2 = pa.Table.from_pandas(df2)

# which prints:

         timestamp  index   value
0                0      0       0
1                0      1       0
2                0      2       0
3               17      0       1
4               17      1    1691
...            ...    ...     ...
1058571    8912845      1    9945
1058572    8912862      0  524286
1058573    8912862      1    9978
1058574    8912879      0  524287
1058575    8912879      1    9723

[1058576 rows x 3 columns]

验证索引= 2的内容是否稀疏：

print(df2[df2['index']==2])

# which prints

        timestamp  index     value
2               0      2         0
35            272      2         1
68            544      2         4
101           816      2         9
134          1088      2        16
...           ...    ...       ...
329837    2718640      2  99900025
329870    2718912      2  99920016
329903    2719184      2  99940009
329936    2719456      2  99960004
329969    2719728      2  99980001

[10000 rows x 3 columns]

和基准测试：

import time

# My method,which sloshes back and forth between PyArrow and numpy
def select_by_index_np(table,ival):
    value_index = table.column('index').to_numpy()
    row_indices = np.nonzero(value_index==ival)[0]
    return table.take(pa.array(row_indices))

# Stay in PyArrow: see https://stackoverflow.com/a/64579502/44330
def select_by_index(table,ival):
    value_index = table.column('index')
    index_type = value_index.type.to_pandas_dtype()
    mask = pc.equal(value_index,index_type(ival))
    return table.filter(mask)

def run_timing_test(table,ival,select_algorithm,nrep=100):
    t1 = time.time_ns()
    for _ in range(nrep):
        tsel = select_algorithm(table,ival)
    t2 = time.time_ns()
    print('%.0fus %20s(%s) -> %s' % 
          ((t2-t1)/1000/nrep,select_algorithm.__name__,tsel.column('value').to_numpy()))

run_timing_test(table2,select_by_index)
run_timing_test(table2,select_by_index_np)
run_timing_test(table2,1,select_by_index_np)

# which prints

7639us      select_by_index(0) -> [     0      1      2 ... 524285 524286 524287]
7780us   select_by_index_np(0) -> [     0      1      2 ... 524285 524286 524287]
7789us      select_by_index(1) -> [   0 1691 3334 ... 9945 9978 9723]
8204us   select_by_index_np(1) -> [   0 1691 3334 ... 9945 9978 9723]
3840us      select_by_index(2) -> [       0        1        4 ... 99940009 99960004 99980001]
1611us   select_by_index_np(2) -> [       0        1        4 ... 99940009 99960004 99980001]

这两种方法是可比当选择的行是表中的一个相当大的部分，但是当它们是非常小的，select_by_index_np，它乘警numpy的，确定索引，其中所述掩模的行是True，并且警回到PyArrow，速度更快！

除了留在PyArrow之外，是否有一种有效的方法？（我看不到任何与numpy.nonzero等效的pyarrow.compute）

解决方法

暂无找到可以解决该程序问题的有效方法，小编努力寻找整理中！

如果你已经找到好的解决方法，欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@）

pyarrow python