对FMA操作进行更激进的优化

问题描述

我想构建一个数据类型，该数据类型表示多种（例如N）算术类型，并使用运算符重载提供与算术类型相同的接口，以便获得类似于Agner Fog的vectorclass的数据类型。

请查看以下示例：Godbolt

#include <array>

using std::size_t;

template<class T,size_t S>
class LoopSIMD : std::array<T,S>
{
public:
    friend LoopSIMD operator*(const T a,const LoopSIMD& x){
        LoopSIMD result;
        for(size_t i=0;i<S;++i)
            result[i] = a*x[i];
        return result;
    }

    LoopSIMD& operator +=(const LoopSIMD& x){
        for(size_t i=0;i<S;++i){
            (*this)[i] += x[i];
        }
        return *this;
    }
};

constexpr size_t N = 7;
typedef LoopSIMD<double,N> SIMD;

SIMD foo(double a,SIMD x,SIMD y){
    x += a*y;
    return x;
}

在一定数量的元素看来，这似乎很好用，gcc-10为6，clang-11为27。对于大量元素，编译器不再使用FMA（例如vfmadd213pd）操作。取而代之的是，它们分别进行乘法运算（例如vmulpd）和加法运算（例如vaddpd）。

问题：

这种行为是否有充分的理由？
是否有任何编译器标志，以便可以将上述gcc值和clang 27值提高？

谢谢！

解决方法

对于gcc 10.2，我做了以下工作，并能获得一些不错的结果，它与您的Godbolt链接具有相同的-Ofast -march=skylake -ffast-math。

friend LoopSIMD operator*(const T a,const LoopSIMD& x) {
    LoopSIMD result;
    std::transform(x.cbegin(),x.cend(),result.begin(),[a](auto const& i) { return a * i; });
    return result;
}

LoopSIMD& operator+=(const LoopSIMD& x) {
    std::transform(this->cbegin(),this->cend(),x.cbegin(),this->begin(),[](auto const& a,auto const& b) { return a + b; });
    return *this;
}

std::transform有一些疯狂的重载，所以我想我需要解释。

第一个重载捕获a，将每个值相乘，然后将其存储回结果的开头。

第二个重载起着zip的作用，将x和this的两个值加在一起并将结果存储回this。

如果您不嫁给operator+=和operator*，则可以像这样创建自己的fma

    LoopSIMD& fma(const LoopSIMD& x,double a ){
        std::transform_inclusive_scan(
            x.cbegin(),std::plus{},[a](auto const& i){return i * a;},0.0);
        return *this;
    }

这需要c ++ 17，但是会循环将SIMD指令保留在其中

foo(double,LoopSIMD<double,40ul>&,40ul> const&):
        xor     eax,eax
        vxorpd  xmm1,xmm1,xmm1
.L2:
        vfmadd231sd     xmm1,xmm0,QWORD PTR [rsi+rax]
        vmovsd  QWORD PTR [rdi+rax],xmm1
        add     rax,8
        cmp     rax,320
        jne     .L2
        ret

您还可以简单地创建自己的fma函数：

template<class T,size_t S>
class LoopSIMD : std::array<T,S>
{
public:
    friend LoopSIMD fma(const LoopSIMD& x,const T y,const LoopSIMD& z) {
        LoopSIMD result;
        for (size_t i = 0; i < S; ++i) {
            result[i] = std::fma(x[i],y,z[i]);
        }
        return result;
    }
    friend LoopSIMD fma(const T y,const LoopSIMD& x,const LoopSIMD& z) {
        LoopSIMD result;
        for (size_t i = 0; i < S; ++i) {
            result[i] = std::fma(y,x[i],z[i]);
        }
        return result;
    }
    // And more variants,taking `const LoopSIMD&,const LoopSIMD&,const T`,`const LoopSIMD&,const T,etc
};

SIMD foo(double a,SIMD x,SIMD y){
    return fma(a,x);
}

但是首先要考虑到更好的优化，您应该对齐阵列。如果这样做，您的原始代码可以很好地优化：

constexpr size_t next_power_of_2_not_less_than(size_t n) {
    size_t pow = 1;
    while (pow < n) pow *= 2;
    return pow;
}

template<class T,S>
{
public:
    // operators
} __attribute__((aligned(next_power_of_2_not_less_than(sizeof(T[S])))));

// Or with a c++11 attribute
/*
template<class T,size_t S>
class [[gnu::aligned(next_power_of_2_not_less_than(sizeof(T[S])))]] LoopSIMD : std::array<T,S>
{
public:
    // operators
};
*/

SIMD foo(double a,SIMD y){
    x += a * y;
    return x;
}

c++clang clang fma gcc gcc