在矩阵运算中，为什么我的GPU比CPU慢？

问题描述

cpu：i7-9750 @ 2.6GHz（带有16G DDR4 Ram）； GPU：Nvidia Geforce GTX 1600 TI（6G）;操作系统：Windows 10-64bit

我试图查看GPU与cpu相比执行基本矩阵运算有多快，我基本上遵循了https://towardsdatascience.com/heres-how-to-use-cupy-to-make-numpy-700x-faster-4b920dda1f56。以下是我的超级简单代码

import numpy as np
import cupy as cp
import time

### Numpy and cpu
s = time.time()
A = np.random.random([10000,10000]); B = np.random.random([10000,10000])
cpu = np.matmul(A,B); cpu *= 5
e = time.time()
print(f'cpu time: {e - s: .2f}')

### CuPy and GPU
s = time.time()
C= cp.random.random([10000,10000]); D = cp.random.random([10000,10000])
GPU = cp.matmul(C,D); GPU *= 5
cp.cuda.Stream.null.synchronize()  
# to let the code finish executing on the GPU before calculating the time
e = time.time()
print(f'GPU time: {e - s: .2f}')

具有讽刺意味的是，它表明 cpu时间：11.74 GPU时间：12.56

这真的使我感到困惑。在大型矩阵操作中，GPU怎么会比cpu慢？请注意，我什至没有应用并行计算（我是初学者，并且不确定系统是否会为我打开并行计算。）我确实检查过类似的问题，例如Why is my CPU doing matrix operations faster than GPU instead?。但是在这里，我使用的是 cupy 而不是 mxnet （cupy是更新的并且专为GPU计算而设计）。

有人可以帮忙吗？我真的很感激！

解决方法

numpy random默认生成浮点数（32位）。 Cupy random默认情况下会生成64bit（双精度）。要进行苹果对苹果的比较，请更改GPU随机数的生成，如下所示：

.homebackground

我的硬件（CPU和GPU）与您不同，但是一旦做出更改，GPU版本将比cpu版本快12倍。使用cupy生成随机数ndarray，矩阵乘法和标量乘法的总时间不到一秒钟

cupy deep-learning gpgpu machine-learning python