ArrayFire 矩阵乘法向量化

问题描述

我使用 ArrayFire 库进行信号处理。我只是想知道如何使我的代码更高效。我在文档中阅读了矢量化指南，但我最终使用了 gfor 构造。有没有可能提高效率？有没有更好的矢量化方法？（我希望有:)

注意：我的目标是 CUDA 性能。

这是我正在尝试改进的代码：

#include <arrayfire.h>
#include <stdio.h>
#include <af/util.h>

static int proc_size = 1024;
static int fft_size  = proc_size * 4;
static int staves    = 288;
static int beams     = 256;

static af::array S;
static af::array B;
static af::array R;

void fn()
{
    gfor ( af::seq i,fft_size )
       R( i,af::span ) = matmul( S( i,af::span ),B( af::span,af::span,i ) );
}

int main(int,char **)
{
    S = af::randn( fft_size,staves,c32 );

    gfor ( af::seq i,fft_size )
        S( i,af::span ) = af::randn( 1,c32 );

    B = af::randn( staves,beams,fft_size,af::dtype::c32 );
    R = af::constant( af::cfloat { 0,0 },beams );

    try
    {
        af::setDevice( 0 );
        af::info();

        double time = af::timeit(fn);

        printf( "Took %f secs.\n",time );
    }
    catch (const af::exception &ex)
    {
        fprintf(stderr,"%s\n",ex.what());
        throw;
    }

    return 0;
}

解决方法

暂无找到可以解决该程序问题的有效方法，小编努力寻找整理中！

如果你已经找到好的解决方法，欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@）

arrayfire c++gpu gpu