抛出'thrust :: system :: system_error'实例后调用终止终止what:parallel_for失败:cudaErrorInvalidValue:无效参数

问题描述

我正在尝试计算curand_uniform()返回1.0的次数。但是我似乎无法获得以下代码为我工作:

#include <stdio.h>
#include <stdlib.h>  
#include <thrust/device_vector.h>
#include <cuda.h>
#include <cuda_runtime.h>
#include <curand_kernel.h>
using namespace std;

__global__
void counts(int length,int *sum,curandStatePhilox4_32_10_t*  state) {
  int tempsum = int(0);
  int i = blockIdx.x * blockDim.x + threadIdx.x;
  curandStatePhilox4_32_10_t localState  =  state[i];
  for(; i < length; i += blockDim.x * gridDim.x) {
    double thisnum = curand_uniform( &localState );
    if ( thisnum == 1.0 ){
      tempsum += 1;
    }
  }
  atomicAdd(sum,tempsum);
}

__global__
void curand_setup(curandStatePhilox4_32_10_t *state,long seed) {
    int id = threadIdx.x + blockIdx.x * blockDim.x;
    curand_init(seed,id,&state[id]);
}

int main(int argc,char *argv[]) {
  const int N = 1e5;

  int* count_h = 0;
  int* count_d;
  cudaMalloc(&count_d,sizeof(int) );
  cudaMemcpy(count_d,count_h,sizeof(int),cudaMemcpyHostToDevice);

  int threads_per_block = 64;
  int Nblocks = 32*6;

  thrust::device_vector<curandStatePhilox4_32_10_t> d_state(Nblocks*threads_per_block);
  curand_setup<<<Nblocks,threads_per_block>>>(d_state.data().get(),time(0));
  counts<<<Nblocks,threads_per_block>>>(N,count_d,d_state.data().get());

  cudaMemcpy(count_h,cudaMemcpyDeviceToHost);

  cout << count_h << endl;

  cudaFree(count_d);
  free(count_h);
}

我遇到终端错误(在 linux):

terminate called after throwing an instance of 'thrust::system::system_error'
  what():  parallel_for failed: cudaErrorInvalidValue: invalid argument
Aborted (core dumped)

我正在这样编译:

nvcc -Xcompiler "-fopenmp" -o test uniform_one_hit_count.cu

我不明白此错误消息。

解决方法

此行:

thrust::device_vector<curandStatePhilox4_32_10_t> d_state(Nblocks*threads_per_block);

正在初始化设备上的新向量。当推力这样做时,它将调用使用中的对象的构造函数,在本例中为curandStatePhilox4_32_10,该结构的定义在/usr/local/cuda/include/curand_philox4x32_x.h中(无论如何,在Linux上)。不幸的是,该结构定义没有提供任何用__device__装饰的构造函数,这给推力带来了麻烦。

一个简单的解决方法是在主机上组装向量并将其复制到设备:

thrust::host_vector<curandStatePhilox4_32_10_t> h_state(Nblocks*threads_per_block);
thrust::device_vector<curandStatePhilox4_32_10_t> d_state = h_state;

或者,只需使用cudaMalloc分配空间:

curandStatePhilox4_32_10_t *d_state;
cudaMalloc(&d_state,(Nblocks*threads_per_block)*sizeof(d_state[0]));

您也至少还有一个其他问题。实际上,这并未为指针应指向的内容提供适当的存储分配:

int* count_h = 0;

之后,您应该执行以下操作:

count_h = (int *)malloc(sizeof(int));
memset(count_h,sizeof(int));

在您的打印输出行上,您最有可能希望这样做:

cout << count_h[0] << endl;

解决count_h问题的另一种方法是从以下开始:

int count_h = 0;

,这将需要对代码(cudaMemcpy操作)进行不同的更改。

相关问答

错误1:Request method ‘DELETE‘ not supported 错误还原:...
错误1:启动docker镜像时报错:Error response from daemon:...
错误1:private field ‘xxx‘ is never assigned 按Alt...
报错如下,通过源不能下载,最后警告pip需升级版本 Requirem...