问题描述
我正在尝试使用 Numba 向量化在 GPU 与 cpu 上添加随机向量。
这是我的例子:
import numpy as np
from timeit import default_timer as timer
from numba import vectorize
TARGET = 'cpu'
#TARGET = 'cuda'
@vectorize(["float64(float64,float64)"],target=TARGET)
def VectorAdd(a,b):
return a + b
def main():
N = 32_000_000
A = np.random.randn(N)
B = np.random.randn(N)
C = np.zeros(N,dtype=np.float64)
print("Target unit: {},number: {}".format(TARGET,N))
start = timer()
C = VectorAdd(A,B)
vADD_time = timer() - start
print("C[:5] = " + str(C[:5]))
print("C[-5:] = " + str(C[-5:]))
print("Time: {}".format(vADD_time))
if __name__ == "__main__":
main()
cpu 比 CUDA 快 30 倍。我做错了什么?因为我希望 CUDA 必须更快。
Target unit: cuda,number: 32000000
C[:5] = [ 1.90362553 -2.6426849 -1.84243752 -0.00806387 0.63785922]
C[-5:] = [ 0.93794028 0.98118905 0.80945834 0.64350251 -1.62342203]
Time: 17.02285827000003
Target unit: cpu,number: 32000000
C[:5] = [ 0.77441334 0.35994057 -0.15359408 -0.20547891 -2.04108084]
C[-5:] = [1.47338646 3.01013048 0.71417303 1.62773266 2.80878941]
Time: 0.5268858470000168
解决方法
您正在执行的操作太简单了,无法充分利用 GPU 提供的并行性;相反,您只会因内存传输开销而损失性能。
尝试运行以下代码,该代码不会通过手动移动数据来衡量花费在数据传输上的时间。
import numpy as np
from timeit import default_timer as timer
from numba import (vectorize,cuda)
# TARGET = 'cpu'
TARGET = 'cuda'
@vectorize(["float64(float64,float64)"],target=TARGET)
def VectorAdd(a,b):
return a + b
def main():
N = 32_000_000
A = np.random.randn(N)
B = np.random.randn(N)
C = np.zeros(N,dtype=np.float64)
A = cuda.to_device(A)
B = cuda.to_device(B)
print("Target unit: {},number: {}".format(TARGET,N))
start = timer()
C = VectorAdd(A,B)
vADD_time = timer() - start
C = C.copy_to_host()
print("C[:5] = " + str(C[:5]))
print("C[-5:] = " + str(C[-5:]))
print("Time: {}".format(vADD_time))
if __name__ == "__main__":
main()
另外,我建议增加您执行的迭代次数以查看 GPU 的加速。