问题描述
我有包含百万/十亿行的镶木地板文件,我试图找到一个更快的过程来应用函数并查询这些大表上的值。 1M行表的ipynb代码示例:
from pyarrow import parquet as pq
from ttictoc import TicToc #Preferred time tracker
t = TicToc()
#Function to be applied
def val_calc(x,n_cols):
import random
import string
#Mapping process
abc_vals = string.ascii_uppercase[:n_cols] #N-alphabets in the df
x_map = {i:random.randint(j,n_cols+10) for i,j in zip(abc_vals,range(n_cols))} #map-dict
#Calculations... Formula: Σ(Xi*Yi)/ΣY
y_vals = {i:j**2 for i,j in x_map.items()} #Y's set of values to use
weights = [x_val*y_val for x_val,y_val in zip(x_map.values(),y_vals.values())] #(Xi*Yi)
result = sum(weights)/sum(y_vals.values()) #Σ(X*Y)/ΣY
return result
#Getting the parquet file
file_path = 'C:/XYZ/project/'
file_name = 'gx6c'
#pyarrow parquet -> pandas
large_pq = pq.read_table(file_path+file_name+'.pq').to_pandas()
#Number of columns - column per alphabet in a row:
n_columns = int(file_name.split('x')[-1].replace('c',''))
#-------------------------------------------------------Results and time taken
t.tic() #start time
#Function applied
large_pq['values'] = large_pq.apply(lambda x: val_calc(x,n_columns),axis=1)
t.toc() #end time
print(f'Time passed for applying function: {round(t.elapsed,5)} seconds')
display(large_pq)
#Querying part
t.tic()
queried = large_pq[large_pq['values'].between(12,13)]
t.toc()
print(f'Time passed for query: {round(t.elapsed,5)} seconds')
display(queried)
输出:
Time passed for applying function: 17.60126 seconds
abc values
0 AAAAAA 13.258228
1 AAAAAB 10.227642
2 AAAABA 11.264317
3 AAABAA 12.422303
4 AABAAA 13.537634
... ... ...
999995 JJIJJJ 12.620214
999996 JJJIJJ 11.323636
999997 JJJJIJ 10.756757
999998 JJJJJI 10.358811
999999 JJJJJJ 10.896328
1000000 rows × 2 columns
Time passed for query: 0.04801 seconds
abc values
3 AAABAA 12.422303
5 ABAAAA 12.062818
13 AAAAAD 12.762040
16 AADAAA 12.925373
25 AAAAAF 12.661267
... ... ...
999967 IJJJII 12.936667
999972 JIJIJI 12.331742
999986 JIJJJI 12.133333
999993 IJJJJJ 12.179487
999995 JJIJJJ 12.620214
284129 rows × 2 columns
对于每行17个字母(或17个“列”)的13M行表重复相同的操作大约需要10分钟,查询步骤为0.22524秒。对于较大的文件,我会遇到内存错误,因此无法达到十亿行的标记。是否有任何变通办法可以在更短的时间范围内执行这些过程,例如对于一个13M行的表来说,是10秒而不是几分钟?
解决方法
暂无找到可以解决该程序问题的有效方法,小编努力寻找整理中!
如果你已经找到好的解决方法,欢迎将解决方案带上本链接一起发送给小编。
小编邮箱:dio#foxmail.com (将#修改为@)