如何快速实现并行cudaMalloc？

问题描述

在4个不同的 NVIDIA V100 GPU 上分配大量内存时，在通过OpenMP进行并行化方面，我观察到以下行为：

使用#pragma omp parallel for指令，因此在每个GPU上并行进行cudamalloc调用，其性能与完全串行化的性能相同。在两个HPC系统上进行了测试并验证了相同的效果： IBM Power AC922 和 AWS EC2 p3dn.24xlarge 。（这些数字是在Power机器上获得的。）

./test 4000000000

# serial

GPU 0: 0.472018550
GPU 1: 0.325776811
GPU 2: 0.334342752
GPU 3: 0.337432169
total: 1.469773541

# parallel

GPU 0: 1.199741600
GPU 2: 1.200597044
GPU 3: 1.200619017
GPU 1: 1.482700315
total: 1.493352924

如何使并行化更快？

这是我的代码：

#include <chrono>
#include <iomanip>
#include <iostream>

int main(int argc,char* argv[]) {
  size_t num_elements = std::stoull(argv[1]);
  
  auto t0s = std::chrono::high_resolution_clock::Now();
  #pragma omp parallel for
  for (int i = 0; i < 4; ++i)
  {
    auto t0is = std::chrono::high_resolution_clock::Now();

    cudaSetDevice(i);
    int* ptr;
    cudamalloc((void**)&ptr,sizeof(int) * num_elements);

    auto t1is = std::chrono::high_resolution_clock::Now();
    std::cout << "GPU " << i << ": " << std::fixed << std::setprecision(9)
            << std::chrono::duration<double>(t1is - t0is).count() << std::endl;
  }

  auto t1s = std::chrono::high_resolution_clock::Now();
  std::cout << "total: " << std::fixed << std::setprecision(9)
            << std::chrono::duration<double>(t1s - t0s).count() << std::endl;

  return 0;
}

您可以使用以下方法来编译微基准测试：

nvcc -std=c++11 -Xcompiler -fopenmp -O3 test.cu -o test

我也尝试使用std::thread而不是OpenMP来获得相同的结果。

解决方法

暂无找到可以解决该程序问题的有效方法，小编努力寻找整理中！

如果你已经找到好的解决方法，欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@）

c++cuda cuda openmp