使用2D std :: vector在SYCL上进行矩阵乘法

问题描述

我是SYCL和C ++的新手。这是我使用2D std::vector进行简单矩阵乘法的内核。


void MatrixMulParallel(queue& q,const std::vector<std::vector<double>>& a_host,const std::vector<std::vector<double>>& b_host,std::vector<std::vector<double>>& c_gpu) {
    /*
        To Multiply: C[M][P] = A[M][N] * B[N][P]
    */
    PROFILE_FUNCTION();
    try {
        size_t M = a_host.size();
        size_t N = a_host[0].size();
        size_t P = b_host[0].size();
        // Create device buffers for A,B,C
        buffer a(a_host.data(),range<2>{M,N});
        buffer b(b_host.data(),range<2>{N,P});
        buffer c(c_gpu.data(),P});

        PROFILE_ScopE("Starting Multiply on GPU");
        std::cout << "GPU::Multiplying A and B into C.\n";
        auto e = q.submit([&](handler& h) {

            auto A = a.get_access<access::mode::read>(h);
            auto B = b.get_access<access::mode::read>(h);
            auto C = c.get_access<access::mode::write>(h);

            h.parallel_for(range<2>{M,P},[=](id<2> index) {
                // index[0] allows accessing ROW index,index[1] is column index
                
                int row = index[0];
                int col = index[1];
                auto sum = 0.0;
                for (int i = 0; i < N; i++)
                    sum += A[row][i] * B[i][col]; // Error #1
                C[index] = sum; // Error #2
                });
            });
        e.wait();
    }
    catch (sycl::exception const& e) {
        std::cout << "An exception is caught while multiplying matrices.\n";
        terminate();
    }
}

我得到两个错误，如下所示：

错误＃1：invalid operands to binary expression ('const std::vector<double,std::allocator<double>>' and 'const std::vector<double,std::allocator<double>>')
错误2：no viable overloaded '='

我尝试查找类似于invalid operands for binary expression (...)的错误，但是似乎没有一个错误可以帮助调试我的特定情况。也许是因为这对初学者不友好。

据我到目前为止的了解，a_host.data()显示了返回类型std::vector<double>（不是std::vector< std::vector<double> >吗？）。

我尝试将std::array用于静态已知的大小，并且可以使用。

如何使用2D std::vector进行这项工作？

任何帮助将不胜感激。

解决方法

2D std::vector<std::vector<T>>没有在内存中连续存储的元素。

一种更好的方法是声明std::vector<T>，其大小为M * N，即线性数组，并将其作为连续的块进行操作。

由于目标向量C应该是2D的，因此创建一个在行和列中都建立索引的内核。 SYCL index实际上填充了可线性访问的内存块。

这是我使用std::vector使它起作用的方法：

template <typename T>
void MatrixMulParallelNaive(queue& q,const std::vector<T>& a_host,const std::vector<T>& b_host,std::vector<T>& c_gpu) {
    /*
        To Multiply: C[M][P] = A[M][N] * B[N][P]
    */
    PROFILE_FUNCTION();
    try {
        
        buffer<double,1> a(a_host.data(),range<1>{a_host.size()}); // 1D
        buffer<double,1> b(b_host.data(),range<1>{b_host.size()}); // 1D
        buffer<double,2> c(c_gpu.data(),range<2>{M,P}); // Create 2D buffer
        PROFILE_SCOPE("Starting Multiply on GPU");
        std::cout << "GPU::Multiplying A and B into C.\n";
        auto e = q.submit([&](handler& h) {

            auto A = a.get_access<access::mode::read>(h);
            auto B = b.get_access<access::mode::read>(h);
            auto C = c.get_access<access::mode::write>(h);
            
            h.parallel_for(range<2>{M,P},[=](id<2> index) {
                // Threading index that iterates over C.
                int row = index[0];
                int col = index[1];
                auto sum = 0.0;
                // Compute result of ONE element of C
                for (int i = 0; i < N; i++)
                    sum += A[row * M + i] * B[i * N + col];
                C[index] = sum;
                });
            });
        e.wait();
    }
    catch (sycl::exception const& e) {
        std::cout << "An exception is caught while multiplying matrices.\n";
        terminate();
    }
}

更一般而言，执行HPC时应避免使用非紧凑型数据结构。对于内存层次结构，它不如连续数组元素友好，并且初始化很复杂。改用类似于md_span和md_array的东西（基本上是类固醇上的Fortran数组:-)）。

c++dpc++intel-oneapi sycl vector vector