基于唯一值的 2D Numpy/CuPy 数组更快迭代

问题描述

我目前正在遍历一个 numpy 数组以对其进行切片并执行一些 ndarray 数组。由于 2001*2001 元素的数组的大小，目前需要的时间非常长。因此，我希望，有人可能会提示，如何加速代码：

import cupy as cp
from time import time

height,width = 187,746
org_sized = cp.random.rand(2001,2001) * 60

height_mat = cp.random.rand(height,width) * 100 # orinally values getting larger from (0,width//2) to the outside with the distance squared

indices = cp.indices((height,width))
y_offsets = indices[0]
x_offsets = indices[1] - (width + 1)/2
angle_mat = cp.round_(2*(90 - cp.rad2deg(cp.arctan2(y_offsets,x_offsets))) + 180).astype(int)

weights = cp.random.rand(361)/ 10  # weights oroiginally larger in the middle

# pad the org_sized matrix with zeros to a fit a size of (2001+heigth,2001+weight)
west = cp.zeros((org_sized.shape[0],width // 2))
east = cp.zeros((org_sized.shape[0],round(width // 2)))

enlarged_size = cp.hstack((west,org_sized))
enlarged_size = cp.hstack((enlarged_size,east))

south = cp.zeros((height,enlarged_size.shape[1]))

enlarged_size = cp.vstack((enlarged_size,south))

shadow_time_hrs = cp.zeros_like(org_sized)


for y in range(org_sized.shape[0]):
    start_time = time()
    for x in range(org_sized.shape[1]):
        # shift h_extras and angles that they match in size,and are correctly aligned
        short_elevations = enlarged_size[y:y+height,x:x+width]

        overshadowed = (short_elevations - org_sized[y,x]) > height_mat
        shadowed_angles = angle_mat * overshadowed
        shadowed_segments = cp.unique(shadowed_angles)
        angle_segments = shadowed_segments

        sum_hours = cp.sum(weights[angle_segments])
        shadow_time_hrs[y,x] = sum_hours
    if (y % 100) == 0:
        print(f"computation for line {y} took: {time() - start_time}.")

首先，我在函数 calc_shadow_point 上使用了 numbas @njit，但结果证明它比不使用时慢了 2 倍。因此，我切换到 numpy 数组到cupy 数组。这提供了大约 50% 的加速。可能是因为数组太小了。

对于这种问题，除了迭代还有其他方法吗，或者有没有办法在迭代器上使用多线程进行迭代？

编辑：我将代码更改为相同运行时的最小示例（每行 org_sized 1.1 秒）。不知何故，我必须提高计算速度。低于当前计算时间 10% 的所有内容都将使代码可用。由于评论，我将 np.unique 改为 cp.unique，但正如评论中所说。它并没有导致仅 6% 的大幅加速。我目前使用的是 GTX 1060。但是什么时候可以设法使用 1660 Ti。

解决方法

unique 很慢（在 CPU 和 GPU 上），因为它通常在内部使用哈希映射或排序。此外，正如您所说，数组太小而无法在 GPU 上高效运行，从而导致巨大的内核开销。希望您不需要它：您可以使用 bincount（带有 minlength=361 和一个展平数组），因为您知道这些值是小的正整数 在有界范围 0:361 内。实际上，您实际上不需要像 bincount 那样计算值，您只想知道 0:361 中存在范围 shadowed_angles 的哪些值。因此，可以使用 Numba 编写更快的 bincount 实现。此外，数组计算可以连续进行，从而减少分配量和内存压力。最后，并行可用于加速计算（使用 Numba 的 prange 和 parallel=True）。

以下是结果基于 CPU 的实现：

@nb.njit
def computeSumHours(org_sized,enlarged_size,angle_mat,height_mat,shadow_time_hrs,y,x):
    height,width = height_mat.shape
    short_elevations = enlarged_size[y:y+height,x:x+width]
    shadowed_segments = np.zeros(361)

    for y2 in range(height):
        for x2 in range(width):
            overshadowed = (short_elevations[y2,x2] - org_sized[y,x]) > height_mat[y2,x2]
            shadowed_angle = angle_mat[y2,x2] * overshadowed
            shadowed_segments[shadowed_angle] = weights[shadowed_angle]

    return shadowed_segments.sum()

@nb.njit(parallel=True)
def computeLine(org_sized,y):
    height,width = height_mat.shape

    for x in nb.prange(org_sized.shape[1]):
        shadow_time_hrs[y,x] = computeSumHours(org_sized,x)

def computeAllLines(org_sized,shadow_time_hrs):
    height,width = height_mat.shape

    for y in range(org_sized.shape[0]):
        start_time = time()
        computeLine(org_sized,y)
        if (y % 100) == 0:
            print("Computation for line %d took: %f." % (y,time() - start_time))

computeAllLines(org_sized,shadow_time_hrs)

以下是我机器上每次迭代的计时结果（使用 i7-9600K 和 GTX-1660-Super）：

Reference implementation (CPU): 2.015 s
Reference implementation (GPU): 0.882 s
Optimized implementation (CPU): 0.082 s

这比基于 GPU 的参考实现快 10 倍，比基于 CPU 的参考实现快 25 倍。

请注意，相同的技术可以在 GPU 上使用，但不能使用 CuPy：需要编写一个 GPU 内核来执行此操作（例如使用 CUDA）。但是，要有效地做到这一点非常复杂。

cupy numpy performance performance performance python