如何有效地使用Dask迭代函数中的数百万个参数？

问题描述

我正在使用dask模块来迭代给定函数processing中的参数。我正在使用的脚本的片段如下。

import dask
from dask import delayed,compute
from dask.distributed import Client,progress

client = Client(threads_per_worker=2,n_workers=2) #Choosing the number of workers and threads per worker.

csv_file = pd.read_csv('coordinates.csv')
longitude = csv_file['Longitude'].values
latitude = csv_file['Latitude'].values

def processing(x,y):

    '''
    '''  
    return (result)

#Now calling the function in a 'dask' way. 
lazy_results = []

for (x,y) in zip(longitude,latitude):
    lazy_result = dask.delayed(processing)(x,y)
    lazy_results.append(lazy_result)
       
#Computing the results
dask.compute(*lazy_results)

对于给定的一小部分参数（x，y），它可以正确运行，并且能够按预期加快迭代过程。但是，我有兴趣知道在dask中对数百万个参数（以上代码中的x，y数百万个）执行相同处理的最有效方法是什么。由于在document中比较快，因此可以假设上述方法（即dask.delayed）对于输入参数的大小（最多仅约100,000个）有效。

解决方法

暂无找到可以解决该程序问题的有效方法，小编努力寻找整理中！

如果你已经找到好的解决方法，欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@）

dask dask-delayed dask-distributed python