如何使用numba在GPU上概括快速矩阵乘法

问题描述

最近我一直在尝试使用Numba库在Python中进行GPU编程。我已经使用他们的教程在他们的网站上对其进行了阅读,目前我只能停留在他们的示例上,可以在以下位置找到它们:https://numba.pydata.org/numba-doc/latest/cuda/examples.html。我试图概括一下快速矩阵乘法的示例(形式为A * B = C)。在测试时,我注意到尺寸不能完全由每块线程数(TPB)整除的矩阵无法得出正确的答案。

我从https://numba.pydata.org/numba-doc/latest/cuda/examples.html的示例中复制了下面的代码,并创建了一个4×4矩阵的非常小的测试用例。如果我选择TPB = 2一切都很好,但是当我将TPB = 3设置为错误时。我知道代码超出了矩阵的范围,但是我无法阻止这种情况的发生(我尝试了一些关于 ty + i * TPB tx + i *的if语句TPB ,但这些无效。

from numba import cuda,float32
import numpy as np
import math

@cuda.jit
def fast_matmul(A,B,C):
    # Define an array in the shared memory
    # The size and type of the arrays must be kNown at compile time
    sA = cuda.shared.array(shape=(TPB,TPB),dtype=float32)
    sB = cuda.shared.array(shape=(TPB,dtype=float32)

    x,y = cuda.grid(2)

    tx = cuda.threadIdx.x
    ty = cuda.threadIdx.y
    bpg = cuda.gridDim.x    # blocks per grid

    if x >= C.shape[0] and y >= C.shape[1]:
        # Quit if (x,y) is outside of valid C boundary
        return

    # Each thread computes one element in the result matrix.
    # The dot product is chunked into dot products of TPB-long vectors.
    tmp = 0.
    for i in range(bpg):
        # Preload data into shared memory
        sA[tx,ty] = A[x,ty + i * TPB]
        sB[tx,ty] = B[tx + i * TPB,y]

        # Wait until all threads finish preloading
        cuda.syncthreads()

        # Computes partial product on the shared memory
        for j in range(TPB):
            tmp += sA[tx,j] * sB[j,ty]

        # Wait until all threads finish computing
        cuda.syncthreads()

    C[x,y] = tmp



#%%

x_h = np.arange(16).reshape([4,4])
y_h = np.ones([4,4])
z_h = np.zeros([4,4])

x_d = cuda.to_device(x_h)
y_d = cuda.to_device(y_h)
z_d = cuda.to_device(z_h)

TPB = 3
threadsperblock = (TPB,TPB)
blockspergrid_x = math.ceil(z_h.shape[0] / threadsperblock[0])
blockspergrid_y = math.ceil(z_h.shape[1] / threadsperblock[1])
blockspergrid = (blockspergrid_x,blockspergrid_y)

fast_matmul[blockspergrid,threadsperblock](x_d,y_d,z_d)
z_h = z_d.copy_to_host()
print(z_h)

我想写一些不依赖于矩阵A,B和C的代码,这些矩阵的尺寸可以被TPB完全整除,因为有时它们不受我的控制。我了解到GPU仅在非常大的矩阵中使用矩阵乘法才能更快,但是我想使用一些小例子来在将答案应用于实际数据之前检查答案是否正确。

解决方法

可以说that posted code中至少有两个错误:

  1. 这可能不是正确的范围检查:

    if x >= C.shape[0] and y >= C.shape[1]:
    

    为了让我们确定网格中的特定线程不执行任何加载活动,我们要求 要么x超出范围,要么y超出范围。 and应该是or

  2. 如果块中的所有线程不能参与该语句,则在条件代码中使用cuda.syncthreads()illegal。上面第1项中的先前return语句(即使从and改成or)也可以保证这种非法行为,因为问题大小不能被线程块大小整除。 / p>

因此,要解决这些问题,我们不能仅对边界线程使用简单的return语句。相反,在装入时,如果计算的全局装入索引(对于AB)是入站的(共享索引为根据定义)。此外,在编写结果时,我们只必须编写C入站的计算结果。

以下代码修复了这些问题。对于给定的测试用例,它似乎可以正常工作:

$ cat t49.py
from numba import cuda,float32
import numpy as np
import math

@cuda.jit
def fast_matmul(A,B,C):
    # Define an array in the shared memory
    # The size and type of the arrays must be known at compile time
    sA = cuda.shared.array(shape=(TPB,TPB),dtype=float32)
    sB = cuda.shared.array(shape=(TPB,dtype=float32)

    x,y = cuda.grid(2)

    tx = cuda.threadIdx.x
    ty = cuda.threadIdx.y
    bpg = cuda.gridDim.x    # blocks per grid

    # Each thread computes one element in the result matrix.
    # The dot product is chunked into dot products of TPB-long vectors.
    tmp = 0.
    for i in range(bpg):
        # Preload data into shared memory
        sA[tx,ty] = 0
        sB[tx,ty] = 0
        if x < A.shape[0] and (ty+i*TPB) < A.shape[1]:
          sA[tx,ty] = A[x,ty + i * TPB]
        if y < B.shape[1] and (tx+i*TPB) < B.shape[0]:
          sB[tx,ty] = B[tx + i * TPB,y]

        # Wait until all threads finish preloading
        cuda.syncthreads()

        # Computes partial product on the shared memory
        for j in range(TPB):
            tmp += sA[tx,j] * sB[j,ty]

        # Wait until all threads finish computing
        cuda.syncthreads()
    if x < C.shape[0] and y < C.shape[1]:
        C[x,y] = tmp



#%%

x_h = np.arange(16).reshape([4,4])
y_h = np.ones([4,4])
z_h = np.zeros([4,4])

x_d = cuda.to_device(x_h)
y_d = cuda.to_device(y_h)
z_d = cuda.to_device(z_h)

TPB = 3
threadsperblock = (TPB,TPB)
blockspergrid_x = math.ceil(z_h.shape[0] / threadsperblock[0])
blockspergrid_y = math.ceil(z_h.shape[1] / threadsperblock[1])
blockspergrid = (blockspergrid_x,blockspergrid_y)

fast_matmul[blockspergrid,threadsperblock](x_d,y_d,z_d)
z_h = z_d.copy_to_host()
print(z_h)
print(x_h@y_h)
$ cuda-memcheck python t49.py
========= CUDA-MEMCHECK
[[ 6.  6.  6.  6.]
 [22. 22. 22. 22.]
 [38. 38. 38. 38.]
 [54. 54. 54. 54.]]
[[ 6.  6.  6.  6.]
 [22. 22. 22. 22.]
 [38. 38. 38. 38.]
 [54. 54. 54. 54.]]
========= ERROR SUMMARY: 0 errors
$

(请注意,在边界测试中使用and是正确的。与测试一组索引是否超出范围相比,测试一组索引是否入站在布尔意义上是不同的在入站测试中,我们都要求两者都入站。在出站测试中,两个索引出站均不合格。

我并不是说上面的代码没有缺陷,也不适合任何特定目的。它旨在演示我发现的问题的可能解决方案。正如您所发现的,让共享内存平铺矩阵乘法以在每种可想象的配置中工作都是不平凡的,而且除了本文所示的内容之外,我还没有对其进行测试。 (例如,如果您决定将TPB设置为大于32,则会遇到其他问题。此外,原始发布的代码仅用于平方矩阵乘法,而在一般的非平方情况下将不起作用。) / p>

如上所述,发布的代码和上面带有“修复”的代码将无法正确处理一般的非方形情况。我相信一些简单的修改将使我们能够处理非平方的情况。简而言之,我们必须将网格的大小调整到足够大,以处理两个输入矩阵的维,同时仍仅写入输出矩阵的入站值的结果。这是一个经过严格测试的示例:

$ cat t49.py
from numba import cuda,y = cuda.grid(2)

    tx = cuda.threadIdx.x
    ty = cuda.threadIdx.y
    bpg = cuda.gridDim.x    # blocks per grid

    # Each thread computes one element in the result matrix.
    # The dot product is chunked into dot products of TPB-long vectors.
    tmp = 0.
    for i in range(bpg):
        # Preload data into shared memory
        sA[ty,tx] = 0
        sB[ty,tx] = 0
        if y < A.shape[0] and (tx+i*TPB) < A.shape[1]:
          sA[ty,tx] = A[y,tx + i * TPB]
        if x < B.shape[1] and (ty+i*TPB) < B.shape[0]:
          sB[ty,tx] = B[ty + i * TPB,x]

        # Wait until all threads finish preloading
        cuda.syncthreads()

        # Computes partial product on the shared memory
        for j in range(TPB):
            tmp += sA[ty,tx]

        # Wait until all threads finish computing
        cuda.syncthreads()
    if y < C.shape[0] and x < C.shape[1]:
        C[y,x] = tmp



#%%

x_h = np.arange(115).reshape([5,23])
y_h = np.ones([23,7])
z_h = np.zeros([5,7])

x_d = cuda.to_device(x_h)
y_d = cuda.to_device(y_h)
z_d = cuda.to_device(z_h)

#TPB must be an integer between 1 and 32
TPB = 32
threadsperblock = (TPB,TPB)
grid_y_max = max(x_h.shape[0],y_h.shape[0])
grid_x_max = max(x_h.shape[1],y_h.shape[1])
blockspergrid_x = math.ceil(grid_x_max / threadsperblock[0])
blockspergrid_y = math.ceil(grid_y_max / threadsperblock[1])
blockspergrid = (blockspergrid_x,z_d)
z_h = z_d.copy_to_host()
print(z_h)
print(x_h@y_h)
$ cuda-memcheck python t49.py
========= CUDA-MEMCHECK
[[ 253.  253.  253.  253.  253.  253.  253.]
 [ 782.  782.  782.  782.  782.  782.  782.]
 [1311. 1311. 1311. 1311. 1311. 1311. 1311.]
 [1840. 1840. 1840. 1840. 1840. 1840. 1840.]
 [2369. 2369. 2369. 2369. 2369. 2369. 2369.]]
[[ 253.  253.  253.  253.  253.  253.  253.]
 [ 782.  782.  782.  782.  782.  782.  782.]
 [1311. 1311. 1311. 1311. 1311. 1311. 1311.]
 [1840. 1840. 1840. 1840. 1840. 1840. 1840.]
 [2369. 2369. 2369. 2369. 2369. 2369. 2369.]]
========= ERROR SUMMARY: 0 errors
$

我还重新排列了xy的含义(以及txty的用法),以解决上述代码中的性能问题。>

同样,没有缺陷的索赔。此外,我确信可以实现“更优化”的代码。但是,优化矩阵乘法是一项应很快导致使用库实现的练习。在GPU方法中使用cupy是一种相当简单的方法,可以利用GPU上的高质量矩阵乘法例程。