带有 2 个 GPU 汽车 NVIDIA A6000 的 N 体 OpenCL 代码它们之间通过 NVLink 连接

问题描述

我想运行一个使用 OpenCL 的旧 N-body。

我有 2 块带有 NVLink 的 NVIDIA A6000 卡，这是一个从硬件（可能还有软件？）角度绑定这 2 块 GPU 卡的组件。

但是在执行时，我得到以下结果：

这是使用的内核代码（我已经写了我估计对 NVIDIA 卡有用的编译指示）：

#pragma OPENCL EXTENSION cl_khr_fp64 : enable

__kernel
void
nbody_sim(
    __global double4* pos,__global double4* vel,int numBodies,double deltaTime,double epsSqr,__local double4* localPos,__global double4* newPosition,__global double4* newVeLocity)
{
    unsigned int tid = get_local_id(0);
    unsigned int gid = get_global_id(0);
    unsigned int localSize = get_local_size(0);

    // Gravitational constant
    double G_constant = 227.17085e-74;

    // Number of tiles we need to iterate
    unsigned int numTiles = numBodies / localSize;

    // position of this work-item
    double4 myPos = pos[gid];
    double4 acc = (double4) (0.0f,0.0f,0.0f);

    for(int i = 0; i < numTiles; ++i)
    {
        // load one tile into local memory
        int idx = i * localSize + tid;
        localPos[tid] = pos[idx];

        // Synchronize to make sure data is available for processing
        barrier(CLK_LOCAL_MEM_FENCE);

        // Calculate acceleration effect due to each body
        // a[i->j] = m[j] * r[i->j] / (r^2 + epsSqr)^(3/2)
        for(int j = 0; j < localSize; ++j)
        {
            // Calculate acceleration caused by particle j on particle i
            double4 r = localPos[j] - myPos;
            double distSqr = r.x * r.x  +  r.y * r.y  +  r.z * r.z;
            double invdist = 1.0f / sqrt(distSqr + epsSqr);
            double invdistCube = invdist * invdist * invdist;
            double s = G_constant * localPos[j].w * invdistCube;

            // accumulate effect of all particles
            acc += s * r;
        }

        // Synchronize so that next tile can be loaded
        barrier(CLK_LOCAL_MEM_FENCE);
    }

    double4 oldVel = vel[gid];

    // updated position and veLocity
    double4 newPos = myPos + oldVel * deltaTime + acc * 0.5f * deltaTime * deltaTime;
    newPos.w = myPos.w;
    double4 newVel = oldVel + acc * deltaTime;

    // write to global memory
    newPosition[gid] = newPos;
    newVeLocity[gid] = newVel;
}

设置内核代码的部分代码如下：

int NBody::setupCL()
{
  cl_int status = CL_SUCCESS;
  cl_event writeEvt1,writeEvt2;

  // The block is to move the declaration of prop closer to its use
  cl_command_queue_properties prop = 0;
  commandQueue = clCreateCommandQueue(
      context,devices[current_device],prop,&status);
  CHECK_OPENCL_ERROR( status,"clCreateCommandQueue Failed.");

    ...

// create a CL program using the kernel source
  const char *kernelName = "NBody_Kernels.cl";
  FILE *fp = fopen(kernelName,"r");
  if (!fp) {
    fprintf(stderr,"Failed to load kernel.\n");
    exit(1);
  }
  char *source = (char*)malloc(10000);
  int sourceSize = fread( source,1,10000,fp);
  fclose(fp);

  // Create a program from the kernel source
  program = clCreateProgramWithSource(context,(const char **)&source,(const size_t *)&sourceSize,&status);

  // Build the program
  status = clBuildProgram(program,devices,NULL,NULL);

  // get a kernel object handle for a kernel with the given name
  kernel = clCreateKernel(
      program,"nbody_sim",&status);
  CHECK_OPENCL_ERROR(status,"clCreateKernel Failed.");

  status = waitForEventAndRelease(&writeEvt1);
  CHECK_ERROR(status,NBODY_SUCCESS,"WaitForEventAndRelease(writeEvt1) Failed");

  status = waitForEventAndRelease(&writeEvt2);
  CHECK_ERROR(status,"WaitForEventAndRelease(writeEvt2) Failed");

  return NBODY_SUCCESS;
}

因此，在创建内核代码时会发生错误。有没有办法将 the 2 GPU 视为具有 NVLINK component 的独特 GPU？我的意思是从软件的角度来看？

如何修复创建内核代码时出现的这个错误？

更新 1：

我）通过修改下面的这个循环，我自愿将 GPU 设备的数量限制为只有一个 GPU（实际上，它只剩下一次迭代）：

  // Print device index and device names
  //for(cl_uint i = 0; i < deviceCount; ++i)
  for(cl_uint i = 0; i < 1; ++i)
  {
    char deviceName[1024];
    status = clGetDeviceInfo(deviceids[i],CL_DEVICE_NAME,sizeof(deviceName),deviceName,NULL);
    CHECK_OPENCL_ERROR(status,"clGetDeviceInfo Failed");

    std::cout << "Device " << i << " : " << deviceName <<" Device ID is "<<deviceids[i]<< std::endl;
  }

  // Set id = 0 for currentDevice with deviceType
  *currentDevice = 0;

  free(deviceids);

  return NBODY_SUCCESS;
}

在经典调用之后做：

 status = clBuildProgram(program,NULL);

但错误仍然存在，在消息下方：

II) 如果我不修改此循环并应用建议的解决方案，即设置 devices[current_device] 而不是 devices，我会收到如下编译错误：

In file included from NBody.hpp:8,from NBody.cpp:1:
/opt/AMDAPPSDK-3.0/include/CL/cl.h:863:16: note:   initializing argument 3 of ‘cl_int clBuildProgram(cl_program,cl_uint,_cl_device_id* const*,const char*,void (*)(cl_program,void*),void*)’
                const cl_device_id * /* device_list */,

我怎样才能绕过这个编译问题？

更新 2：

我已经在代码的这一部分打印了 status 变量的值：

我得到了 status = -44 的值。从 CL/cl.h 开始，它将对应于 CL_INVALID_PROGRAM 错误：

然后，当我执行应用程序时，我得到：

我想知道我是否没有错过在内核代码中添加特殊编译指示，因为我在 NVIDIA 卡上使用 OpenCL，不是吗？

顺便说一下，变量 devices 的类型是什么？我无法正确打印。

更新 3：

我添加了以下几行，但在执行时仍然 -44 error。我没有放置所有相关代码，而是提供以下链接来下载源文件：http://31.207.36.11/NBody.cpp 和用于编译的 Makefile：http://31.207.36.11/Makefile。也许有人会发现一些错误，但我最想知道为什么我会得到这个 error -44 。

解决方法

您的内核代码看起来不错，缓存平铺实现是正确的。仅确保主体数量是局部大小的倍数，或者另外将内部 for 循环限制为全局大小。

OpenCL 允许并行使用多个设备。您需要为每个设备单独创建一个带有队列的线程。您还需要手动处理设备与设备之间的通信和同步。数据传输通过 PCIe 进行（您也可以进行远程直接内存访问）；但是您不能将 NVLink 与 OpenCL 一起使用。这在您的情况下应该不是问题，因为与算术数量相比，您只需要很少的数据传输。

补充几点：

在许多情况下，N-body 需要 FP64 来总结不同长度范围内的力并解析位置。但是在 A6000 上，FP64 的性能很差，就像在 GeForce Ampere 上一样。 FP32 会快很多（~64 倍），但这里的准确性可能不够。要获得高效的 FP64，您需要 A100 或 MI100。
使用 rsqrt 代替 1.0/sqrt。这是硬件支持的，几乎和乘法一样快。
确保一致地使用 FP32 浮点数 (1.0f) 或 FP64 双精度数 (1.0) 文字。将 double 文字与 float 一起使用会触发双重算术并将结果转换回 float，这会慢得多。

编辑：为了帮助您解决错误消息：很可能是 clCreateKernel 处的错误（调用 status 后 clCreateKernel 有什么值？）提示 program是无效的。这可能是因为您给 clBuildProgram 一个包含 2 个设备的向量，但将设备数量设置为仅 1，并且 context 仅用于 1 个设备。试试

status = clBuildProgram(program,1,&devices[current_device],NULL,NULL);

只有一个设备。

要使用多 GPU，请在 CPU 上创建两个线程，为 GPU 0 和 1 独立运行 NBody::setupCL()，然后手动进行同步。

编辑 2：我看不出你在哪里创造了 context。如果没有有效的上下文，program 将无效，因此 clBuildProgram 将抛出错误 -44。打电话

context = clCreateContext(0,NULL);

在您对 context 执行任何操作之前。

c++linux linux nvidia nvlink opencl opencl