cuda的向量化，一个以复数作为输入，一个复数作为输出的函数在numba中失败

问题描述

我使用了一个程序来绘制曼德勃罗图，并使用 njit 让它在 cpu 线程上运行。现在我想生成一个 32k 的图像，但即使是整个线程也太慢了。所以我试图让代码在 GPU 上运行。代码如下：

from numba import njit,cuda,vectorize
from PIL import Image,ImageDraw


@vectorize(['complex128(complex128)'],target='cuda')
def mandelbrot(c):

    z = 0
    n = 0
    while abs(z) <= 2 and n < 80:
        z = z*z + c
        n += 1
    return n


def vari(WIDTH,HEIGHT,RE_START,RE_END,IM_START,IM_END,draw):

    for x in range(0,WIDTH):

        for y in range(0,HEIGHT):

            print(x)
            # Convert pixel coordinate to complex number
            c = complex(RE_START + (x / WIDTH) * (RE_END - RE_START),IM_START + (y / HEIGHT) * (IM_END - IM_START))
            # Compute the number of iterations
            m = mandelbrot(c)
            # The color depends on the number of iterations
            color = 255 - int(m * 255 / 80)
            # Plot the point
            draw.point([x,y],(color,color,color))


def vai():
    # Image size (pixels)
    WIDTH = 15360
    HEIGHT = 8640

    # Plot window
    RE_START = -2
    RE_END = 1
    IM_START = -1
    IM_END = 1

    palette = []

    im = Image.new('RGB',(WIDTH,HEIGHT),(0,0))
    draw = ImageDraw.Draw(im)
    vari(WIDTH,draw )

    im.save('output.png','PNG')

vai()

这里是错误：

D:\anaconda\python.exe C:/Users/techguy/PycharmProjects/mandelbrot/main.py
0
Traceback (most recent call last):
  File "C:/Users/techguy/PycharmProjects/mandelbrot/main.py",line 56,in <module>
    vai()
  File "C:/Users/techguy/PycharmProjects/mandelbrot/main.py",line 52,in vai
    vari(WIDTH,draw )
  File "C:/Users/techguy/PycharmProjects/mandelbrot/main.py",line 30,in vari
    m = mandelbrot(c)
  File "D:\anaconda\lib\site-packages\numba\cuda\dispatcher.py",line 41,in __call__
    return CUDAUFuncmechanism.call(self.functions,args,kws)
  File "D:\anaconda\lib\site-packages\numba\np\ufunc\deviceufunc.py",line 301,in call
    cr.launch(func,shape[0],stream,devarys)
  File "D:\anaconda\lib\site-packages\numba\cuda\dispatcher.py",line 152,in launch
    func.forall(count,stream=stream)(*args)
  File "D:\anaconda\lib\site-packages\numba\cuda\compiler.py",line 372,in __call__
    kernel = self.kernel.specialize(*args)
  File "D:\anaconda\lib\site-packages\numba\cuda\compiler.py",line 881,in specialize
    specialization = dispatcher(self.py_func,[types.void(*argtypes)],File "D:\anaconda\lib\site-packages\numba\cuda\compiler.py",line 808,in __init__
    self.compile(sigs[0])
  File "D:\anaconda\lib\site-packages\numba\cuda\compiler.py",line 935,in compile
    kernel.bind()
  File "D:\anaconda\lib\site-packages\numba\cuda\compiler.py",line 576,in bind
    self._func.get()
  File "D:\anaconda\lib\site-packages\numba\cuda\compiler.py",line 446,in get
    ptx = self.ptx.get()
  File "D:\anaconda\lib\site-packages\numba\cuda\compiler.py",line 414,in get
    arch = nvvm.get_arch_option(*cc)
  File "D:\anaconda\lib\site-packages\numba\cuda\cudadrv\nvvm.py",line 345,in get_arch_option
    return 'compute_%d%d' % arch
TypeError: not enough arguments for format string

Process finished with exit code 1

如果我用 @vectorize 代替 @njit(nogil=true) 它工作正常但它在 cpu 上运行。我绝对需要它在 GPU 上运行。我认为问题类似于复杂类型。
有什么问题？

代码不是我的：我在 How to plot the Mandelbrot set 找到的。

我只是修改了一些部分。

这是一个最小的可重现示例：

from numba import  cuda,vectorize

@vectorize(['int32(complex128)'],target='cuda')
def mandelbrot(c):

    z = 0
    n = 0
    while abs(z) <= 2 and n < 80:
        z = z*z + c
        n += 1
    return n

comple = complex(10,12)
print(mandelbrot(comple))

解决方法

您表现出对 vectorize 的功能缺乏非常基本的了解，更不用说 cuda。在你看这个答案之前，你应该在这里阅读：https://numba.pydata.org/numba-doc/dev/user/vectorize.html

您似乎缺少基本信息，例如，numba 上下文之外的矢量化通常意味着什么？ Vector 意味着我们正在对某个数组又名 vector 输入运行 SIMD 操作。看看你的代码：

@vectorize(['complex128(complex128)'],target='cuda')
def mandelbrot(c):

    z = 0
    n = 0
    while abs(z) <= 2 and n < 80:
        z = z*z + c
        n += 1
    return n

当您添加该装饰器时，您将此函数转换为矢量化版本。没有装饰器，它需要一个标量值，即单个复数值。当您转换它时，mandebrot 将期望一个向量值，以便每个值都可以*并行运行。那么您能发现您刚刚在这里创建的函数被大量滥用吗？

def vari(WIDTH,HEIGHT,RE_START,RE_END,IM_START,IM_END,draw):

    for x in range(0,WIDTH):

        for y in range(0,HEIGHT):

            print(x)
            # Convert pixel coordinate to complex number
            c = complex(RE_START + (x / WIDTH) * (RE_END - RE_START),IM_START + (y / HEIGHT) * (IM_END - IM_START))
            # Compute the number of iterations
            m = mandelbrot(c)
            # The color depends on the number of iterations
            color = 255 - int(m * 255 / 80)
            # Plot the point
            draw.point([x,y],(color,color,color))

您的 mandelbrot 函数在循环内对标量值进行操作。换句话说，您以最糟糕的方式和错误地使用了矢量化函数。看看这个转换后的代码：

def vari(WIDTH,draw):

    complex_mat = np.empty((HEIGHT,WIDTH),dtype=np.complex128)
    for x in range(0,WIDTH):
        for y in range(0,HEIGHT):
            print(x)
            # Convert pixel coordinate to complex number
            c = complex(RE_START + (x / WIDTH) * (RE_END - RE_START),IM_START + (y / HEIGHT) * (IM_END - IM_START))
            complex_mat[y,x] = c


    # Compute the number of iterations
    m = mandelbrot(complex_mat)
    for x in range(0,HEIGHT):
            # The color depends on the number of iterations
            color = 255 - int(m[y,x] * 255 / 80)
            # Plot the point
            draw.point([x,color))

我们首先创建要输入到“向量化函数”中的“向量”，在这种情况下，任何 numpy 数组都应该做，它只会以相同的形状输出按元素应用。

现在你仍然会看到这段代码很慢。同样，还有另一个非常基本的缺乏理解，这表明缺乏先前的研究。我建议您对此代码进行基准测试，并且在您向 SO 寻求有关如何提高速度的建议之前这样做。您可能会发现它甚至不是直接导致速度变慢的“mandelbrot”代码。您所做的其他所有事情仍在序列化。您需要将复数生成和曼德尔布罗和点生成移动到 GPU 上。我不确定如何使用 numba 来做到这一点，但这远远超出了您的问题范围，这可能有用，

https://github.com/numba/numba/issues/4309

看来您将希望使用内置的 cuda 并行化工具而不是矢量化来确保您不必将无用的数据传递给 GPU（即，您只需遍历您需要为其生成值的像素，而不是将像素的索引传递给 CUDA）。

除了在 CPU 和 GPU 之间来回传递大量数据之外，代码变慢的另一个原因是使用了 complex128。 GPU 有时没有“快速”双精度，特别是 Nvidia 倾向于将消费级 GPU 的双精度性能降低到双精度可以是浮点速度的 1/32 的程度。这是相关的，因为 complex128 实际上是 2 个粘在一起的双精度值。 complex64 可能会提供更好的速度。在本实验中您可能不会遇到精度较低的问题，即当您放大 mandelbrot 集时，您可能会遇到精度错误。有一些技术可以通过无缝“包装”计算 mandelbrot 集的函数来解决这个问题，以防止这些伪影。然而，这超出了这个问题的范围。

最后，当我运行修改后的代码时，它运行良好。换句话说，我没有

  File "D:\anaconda\lib\site-packages\numba\cuda\cudadrv\nvvm.py",line 345,in get_arch_option
    return 'compute_%d%d' % arch
TypeError: not enough arguments for format string

错误。如果您在运行我修改后的版本时仍然出现此错误，那么您还有一些其他配置错误，由于缺乏研究，该错误太广泛且超出了本问题的范围，例如，它可能与“did您安装了 cuda”，但如果没有更集中的问题，我们就无法知道。这是我生成的输出（更小，以便它符合 SO 的大小要求）。注意我没有替换

@vectorize(['complex128(complex128)'],target='cuda')

与

@vectorize(['int32(complex128)'],target='cuda')

而这不是适合您的问题的解决方案。这再次指向一些用户特定的配置错误。

问题已通过更换解决

f.~Foo();

与

@vectorize(['complex128(complex128)'],target='cuda')

这并不意味着性能更好：它更糟。我认为这是因为该程序不可并行化。唯一使性能更好的是使用

@vectorize(['int32(complex128)'],target='cuda')

真正的问题是我没有安装 @njit(nogil=True)。我正在使用 cudatoolkit。这是一个简单的修复：

anaconda

cpython cuda cuda numba vectorization