问题描述
我问a related question,这是一种基于列内容从PyArrow表中选择行的更惯用的方法。 @joris的答案看起来不错。但是,看起来仅在带有行掩码的PyArrow中选择行存在稀疏选择的性能问题。
是否有一种更有效的方法来执行此操作,但仅将其留在PyArrow中而不在PyArrow和numpy之间来回移动?
测试用例:
import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd
import numpy as np
# Example table for data schema:
# Alternating rows with index 0 and 1
irow = np.arange(2**20)
dt = 17
df0 = pd.DataFrame({'timestamp': np.array((irow//2)*dt,dtype=np.int64),'index': np.array(irow%2,dtype=np.int16),'value': np.array(irow*0,dtype=np.int32)},columns=['timestamp','index','value'])
ii = df0['index'] == 0
df0.loc[ii,'value'] = irow[ii]//2
ii = df0['index'] == 1
df0.loc[ii,'value'] = (np.sin(df0.loc[ii,'timestamp']*0.01)*10000).astype(np.int32)
# Insert rows with index 2 every 16 timestamps
irow = np.arange(10000)
subsample = 16
df1 = pd.DataFrame({'timestamp': np.array(irow*dt*subsample,'index': np.full_like(irow,2,'value': np.array(irow*irow,'value'],index=irow*subsample*2+1.5)
df2=pd.concat([df0,df1]).sort_index()
df2.index = pd.RangeIndex(len(df2))
print(df2)
table2 = pa.Table.from_pandas(df2)
# which prints:
timestamp index value
0 0 0 0
1 0 1 0
2 0 2 0
3 17 0 1
4 17 1 1691
... ... ... ...
1058571 8912845 1 9945
1058572 8912862 0 524286
1058573 8912862 1 9978
1058574 8912879 0 524287
1058575 8912879 1 9723
[1058576 rows x 3 columns]
验证索引= 2的内容是否稀疏:
print(df2[df2['index']==2])
# which prints
timestamp index value
2 0 2 0
35 272 2 1
68 544 2 4
101 816 2 9
134 1088 2 16
... ... ... ...
329837 2718640 2 99900025
329870 2718912 2 99920016
329903 2719184 2 99940009
329936 2719456 2 99960004
329969 2719728 2 99980001
[10000 rows x 3 columns]
和基准测试:
import time
# My method,which sloshes back and forth between PyArrow and numpy
def select_by_index_np(table,ival):
value_index = table.column('index').to_numpy()
row_indices = np.nonzero(value_index==ival)[0]
return table.take(pa.array(row_indices))
# Stay in PyArrow: see https://stackoverflow.com/a/64579502/44330
def select_by_index(table,ival):
value_index = table.column('index')
index_type = value_index.type.to_pandas_dtype()
mask = pc.equal(value_index,index_type(ival))
return table.filter(mask)
def run_timing_test(table,ival,select_algorithm,nrep=100):
t1 = time.time_ns()
for _ in range(nrep):
tsel = select_algorithm(table,ival)
t2 = time.time_ns()
print('%.0fus %20s(%s) -> %s' %
((t2-t1)/1000/nrep,select_algorithm.__name__,tsel.column('value').to_numpy()))
run_timing_test(table2,select_by_index)
run_timing_test(table2,select_by_index_np)
run_timing_test(table2,1,select_by_index_np)
# which prints
7639us select_by_index(0) -> [ 0 1 2 ... 524285 524286 524287]
7780us select_by_index_np(0) -> [ 0 1 2 ... 524285 524286 524287]
7789us select_by_index(1) -> [ 0 1691 3334 ... 9945 9978 9723]
8204us select_by_index_np(1) -> [ 0 1691 3334 ... 9945 9978 9723]
3840us select_by_index(2) -> [ 0 1 4 ... 99940009 99960004 99980001]
1611us select_by_index_np(2) -> [ 0 1 4 ... 99940009 99960004 99980001]
这两种方法是可比当选择的行是表中的一个相当大的部分,但是当它们是非常小的,select_by_index_np
,它乘警numpy的,确定索引,其中所述掩模的行是True,并且警回到PyArrow,速度更快!
除了留在PyArrow之外,是否有一种有效的方法? (我看不到任何与numpy.nonzero
等效的pyarrow.compute)
解决方法
暂无找到可以解决该程序问题的有效方法,小编努力寻找整理中!
如果你已经找到好的解决方法,欢迎将解决方案带上本链接一起发送给小编。
小编邮箱:dio#foxmail.com (将#修改为@)