何时以及如何在simd例程中执行浮点转换？

问题描述

我正在计算2张图像的双向（水平和垂直）前缀总和（扫描），得出像素总和，平方总和以及这两个图像的叉积。所有计算均以32位整数完成，然后进入最后一遍，该过程需要将32位整数转换为double值以计算开窗函数中两个图像的均值，方差和协方差。

首先-这是执行此操作的最佳方法吗？我可以将整个前缀和数组加倍构建，并且没有转换步骤。

第二个-如果这是正确的方法，那么使用压缩的double simd操作是否可以获得很多好处？我只能确定我会一次拿出2个单位。

第三-我应该将数据单元打包在一起还是以当前使用的平面格式保留？ [平面格式是一种像素被“分量”分解的格式。如果获得32位RGBA输入，即8位R，8位G，8位B和8位A，则打包格式将为RGBARGBA，而平面格式将为RRRRRRRRRRRR .... GGGGGGGGGGGG .... BBBBB .. ... AAAAA ...等。]

以下是我到目前为止完成的与该主题相关的三个功能。前两个是标量版本，因此更容易阅读和理解正在发生的事情。第三个是功能1的当前SIMD实现。第四个功能（缺少但尚未完成）是此问题的主题，很可能是第二个功能的SIMD实现。

std::unique_ptr<uint32_t[],boost::alignment::aligned_delete> computeSumMatrixForwardScalar2PassAll(uint8_t const* pImgData1,uint8_t const* pImgData2,unsigned width,unsigned height)
{
  using namespace simdpp;
  std::unique_ptr<uint32_t[],boost::alignment::aligned_delete> sumArray((uint32_t*)boost::alignment::aligned_alloc(64,5*width*height*sizeof(uint32_t)));
  auto pSumArray = sumArray.get();
  BOOST_ALIGN_ASSUME_ALIGNED(pImgData1,64);
  BOOST_ALIGN_ASSUME_ALIGNED(pImgData2,64);
  BOOST_ALIGN_ASSUME_ALIGNED(pSumArray,64);
//#pramga omp parallel for private(h) shared(pImgData,pSumArray,w )

#pragma omp for simd
  for (unsigned h = 0; h < height; ++h)
  {
    uint32_t lastValX = 0;
    uint32_t lastValY = 0;
    uint32_t lastValXX = 0;
    uint32_t lastValYY = 0;
    uint32_t lastValXY = 0;
    for (unsigned w = 0; w < width; ++w)
    {
      uint32_t imgValX      = pImgData1[h * width + w];
      uint32_t newValX      = lastValX + imgValX;
      uint32_t newValXX = lastValXX + imgValX*imgValX;
      uint32_t imgValY = pImgData2[h*width + w];
      uint32_t newValY = lastValY + imgValY;
      uint32_t newValYY = lastValYY + imgValY*imgValY;
      uint32_t newValXY = lastValXY + imgValX*imgValY;
      pSumArray[h*width + w]= newValX;
      pSumArray[width*height+h*width + w] = newValY;
      pSumArray[2*width*height+ h*width + w] = newValXX;

      pSumArray[3*width*height+h*width + w] = newValYY;
      pSumArray[4*width*height+h*width + w] = newValXY;
      lastValX              = newValX;
      lastValXX = newValXX;
      lastValY = newValY;
      lastValYY = newValYY;
      lastValXY = newValXY;
    }
  }
  for (unsigned i = 0; i < 5; ++i) {
    for (unsigned h = 0; h+1 < height; ++h)
    {
      for (unsigned w = 0; w < width; ++w) {
        uint32_t above = pSumArray[i*width*height + h * width + w];
        uint32_t current = pSumArray[i*width*height+ (h+1) *width +w];
        pSumArray[i*width*height + (h+1) * width +w]= above+current;
      }
    }
  }

  return sumArray;
}

第二：SSIM转换-注意不同的语言-因为我还没有完成它的C ++实现。请注意，它在其中调用了weberSumMatrix，与上面的函数相同。

export function webeRSSim(
  pixels1: ImageMatrix,pixels2: ImageMatrix,options: Options
): MSSIMMatrix {
  // console.time("weber ssim");
  const { bitDepth,k1,k2,windowSize} = options
  const L = (1 << bitDepth) - 1
  const c1 = k1 * L * (k1 * L)
  const c2 = k2 * L * (k2 * L)
  const windowSquared = windowSize * windowSize
  const pixels1Data = pixels1.data;
  const pixels2Data = pixels2.data;
  const width = pixels1.width;
  const height = pixels1.height;
  // Produces exactly the same output as the C++ prefix sum above.
  const sumMatrix = weberSumMatrix(pixels1Data,pixels2Data,width,height);
  const windowHeight = height-windowSize;
  const windowWidth = width-windowSize;
  const imageSize = width*height;
  const ssims = new Array(windowHeight*windowWidth);


  // lets handle w = 0 h = 0 first and initialize mssim

  let cumulativeSsim;
  const reciprocalWindowSquared =  1 / windowSquared;
  {
    const windowOffset = windowSize - 1;
    let bottomOffset = windowOffset*width;
    {
      const meanx = (sumMatrix[bottomOffset+ windowOffset]) * reciprocalWindowSquared;
      const meany = (
        sumMatrix[imageSize + bottomOffset+ windowOffset]) * reciprocalWindowSquared;
      const varx = (
        sumMatrix[2*imageSize + bottomOffset+ windowOffset]) * reciprocalWindowSquared - meanx*meanx ;
      const vary = (
        sumMatrix[3*imageSize + bottomOffset+ windowOffset])  * reciprocalWindowSquared - meany*meany;
      const cov = (
        sumMatrix[4*imageSize + bottomOffset+ windowOffset])  * reciprocalWindowSquared - meanx*meany;
      const na = 2 * meanx * meany + c1
      const nb = 2 * cov + c2
      const da = meanx * meanx + meany * meany + c1
      const db = varx + vary + c2
      const ssim = (na * nb) / (da * db)
      ssims[0] = ssim
      // mssim = ssim
      cumulativeSsim = ssim;
    }



    // next handle all of the h = 0,w > 0 cases first
    for (let w = 1; w <  windowWidth; ++w) {
      // in h =0 cases,there is no top left or top right
      let leftOffset = w - 1;
      const rightx = sumMatrix[bottomOffset+leftOffset];
      const leftx = sumMatrix[bottomOffset+(windowOffset+w)];
      const meanx = (leftx-rightx)* reciprocalWindowSquared;
      const righty= sumMatrix[imageSize + bottomOffset+ leftOffset];
      const lefty = sumMatrix[imageSize + bottomOffset+ (windowOffset+w)];
      const meany = (lefty-righty) * reciprocalWindowSquared;
      const rightxx = sumMatrix[2*imageSize + bottomOffset+leftOffset];
      const leftxx = sumMatrix[2*imageSize + bottomOffset+ (windowOffset+w)];
      const varx = (leftxx-rightxx) * reciprocalWindowSquared - meanx*meanx ;
      const rightyy = sumMatrix[3*imageSize + bottomOffset+leftOffset];
      const leftyy = sumMatrix[3*imageSize + bottomOffset+ (windowOffset+w)]
      const vary = (leftyy - rightyy)  * reciprocalWindowSquared - meany*meany;
      const rightxy = sumMatrix[4*imageSize + bottomOffset+leftOffset];
      const leftxy = sumMatrix[4*imageSize + bottomOffset+ (windowOffset+w)];
      const cov = (leftxy-rightxy)  * reciprocalWindowSquared - meanx*meany;
      const na = 2 * meanx * meany + c1
      const nb = 2 * cov + c2
      const da = meanx * meanx + meany * meany + c1
      const db = varx + vary + c2
      const ssim = (na * nb) / (da *db)
      ssims[w] = ssim
      // mssim = mssim + (ssim - mssim) / (i + 1)
      cumulativeSsim += ssim;
    }
  }

  const windowOffset = windowSize - 1;
  // There will be lots of branch misses if we don't split the w==0 and h==0 cases
  for (let h = 1; h < windowHeight; ++h) {
    // Now the w=0 on each line
    let bottomOffset = (h+windowSize-1)*width;
    let topOffset = (h-1)*width;
    {
      // since there is no left side we can skip two operations
      const topx = sumMatrix[topOffset+ windowOffset];
      const bottomx = sumMatrix[bottomOffset+ windowOffset];
      const meanx = (bottomx - topx) * reciprocalWindowSquared;
      const topy = sumMatrix[imageSize + topOffset+ windowOffset];
      const bottomy = sumMatrix[imageSize + bottomOffset+ windowOffset];
      const meany = (bottomy - topy) * reciprocalWindowSquared;
      const topxx = sumMatrix[2*imageSize + topOffset+ windowOffset];
      const bottomxx = sumMatrix[2*imageSize + bottomOffset+ windowOffset];
      const varx = (bottomxx-topxx)  * reciprocalWindowSquared - meanx*meanx ;
      const topyy = sumMatrix[3*imageSize + topOffset+ windowOffset];
      const bottomyy = sumMatrix[3*imageSize + bottomOffset+ windowOffset];
      const vary = (bottomyy-topyy)  * reciprocalWindowSquared - meany*meany;
      const topxy = sumMatrix[4*imageSize + topOffset+ windowOffset];
      const bottomxy = sumMatrix[4*imageSize + bottomOffset+ windowOffset];
      const cov = (bottomxy-topxy)  * reciprocalWindowSquared - meanx*meany;
      const na = 2 * meanx * meany + c1
      const nb = 2 * cov + c2
      const da = meanx * meanx + meany * meany + c1
      const db = varx + vary + c2
      const ssim = (na * nb) / (da *db)
      ssims[h*windowWidth] = ssim
      // mssim = mssim + (ssim - mssim) / (i + 1)
      cumulativeSsim += ssim;
    }


    for (let w = 1; w < windowWidth; ++w) {
      // add top left sub top right sub bottom left add bottom right
      const rightOffset = w + windowSize - 1;
      const leftOffset = w - 1;
      const meanx = (sumMatrix[topOffset + leftOffset]
        - sumMatrix[topOffset+ rightOffset]
        - sumMatrix[bottomOffset+leftOffset]
        + sumMatrix[bottomOffset+ rightOffset]) * reciprocalWindowSquared;
      const meany = (sumMatrix[imageSize+ topOffset + leftOffset]
        - sumMatrix[imageSize + topOffset+ rightOffset]
        - sumMatrix[imageSize + bottomOffset+leftOffset]
        + sumMatrix[imageSize + bottomOffset+ rightOffset]) * reciprocalWindowSquared;
      const varx = (sumMatrix[2*imageSize+ topOffset + leftOffset]
        - sumMatrix[2*imageSize + topOffset+ rightOffset]
        - sumMatrix[2*imageSize + bottomOffset+leftOffset]
        + sumMatrix[2*imageSize + bottomOffset+ rightOffset]) * reciprocalWindowSquared - meanx*meanx ;
      const vary = (sumMatrix[3*imageSize+ topOffset + leftOffset]
        - sumMatrix[3*imageSize + topOffset+ rightOffset]
        - sumMatrix[3*imageSize + bottomOffset+leftOffset]
        + sumMatrix[3*imageSize + bottomOffset+ rightOffset])  * reciprocalWindowSquared - meany*meany;
      const cov = (sumMatrix[4*imageSize+ topOffset + leftOffset]
        - sumMatrix[4*imageSize + topOffset+ rightOffset]
        - sumMatrix[4*imageSize + bottomOffset+leftOffset]
        + sumMatrix[4*imageSize + bottomOffset+ rightOffset])  * reciprocalWindowSquared - meanx*meany;
      const na = 2 * meanx * meany + c1
      const nb = 2 * cov + c2
      const da = meanx * meanx + meany * meany + c1
      const db = varx + vary + c2
      const ssim = (na * nb) / (da * db)
      ssims[h*windowWidth+w] = ssim
      cumulativeSsim += ssim;
      // mssim = mssim + (ssim - mssim) / (i + 1)
    }
  }
  const mssim = cumulativeSsim / (windowHeight*windowWidth);


  return { data: ssims,height,mssim }
}

第三：当前SIMD前缀总和。

std::unique_ptr<uint32_t[],boost::alignment::aligned_delete> computeSumMatrixForwardSimd2PassAll(uint8_t const* pImgData1,w )


  uint32x4 zero = make_zero();
  for (unsigned h = 0; h < height; ++h)
  {
    uint32x4 lastValSplatX = zero;
    uint32x4 lastValSplatY = zero;
    uint32x4 lastValSplatXX = zero;
    uint32x4 lastValSplatYY = zero;
    uint32x4 lastValSplatXY = zero;
    for (unsigned w = 0; w < width; w += 16)
    {
      // starting left value
      // prevIoUs line values..
      prefetch_read(pImgData1+(w+1)*64);
      prefetch_read(pImgData2+(w+1)*64);
      uint32v4 imgdatax = to_uint32(uint8x16(load(pImgData1 + h * width + w)));
      uint32v4 imgDataY = to_uint32(uint8x16(load(pImgData2 + h * width + w)));
      static_assert(uint32v4::vec_length == 4);
      static_assert(sizeof(uint32v4::base_vector_type::native_type) == 16);
      for (unsigned i = 0 ; i < uint32v4::vec_length; ++i) {
        // a_0 a_1 a_2 a_3
        uint32v4::base_vector_type   x = imgdatax.vec(i);
        uint32v4::base_vector_type  y = imgDataY.vec(i);
        uint32v4::base_vector_type xx = mul_lo(x,x);
        uint32v4::base_vector_type yy = mul_lo(y,y);
        uint32v4::base_vector_type xy = mul_lo(x,y);
        // a_0 a_0+a_1 a_1+a_2 a_2+a_3
        x = add(x,move4_r<1>(x));
        x = add(x,move4_r<2>(x));
        x = add(x,lastValSplatX);
        lastValSplatX = permute4<3,3,3>(x);
        store(pSumArray+h*width+w+i*4,x);
        y = add(y,move4_r<1>(y));
        y = add(y,move4_r<2>(y));
        y = add(y,lastValSplatY);
        lastValSplatY = permute4<3,3>(y);
        store(width*height+pSumArray+h*width+w+i*4,y);
        xx = add(xx,move4_r<1>(xx));
        xx = add(xx,move4_r<2>(xx));
        xx = add(xx,lastValSplatXX);
        lastValSplatXX = permute4<3,3>(xx);
        store(2*width*height+pSumArray+h*width+w+i*4,xx);
        yy = add(yy,move4_r<1>(yy));
        yy = add(yy,move4_r<2>(yy));
        yy = add(yy,lastValSplatYY);
        lastValSplatYY = permute4<3,3>(yy);
        store(3*width*height+pSumArray+h*width+w+i*4,yy);
        xy = add(xy,move4_r<1>(xy));
        xy = add(xy,move4_r<2>(xy));
        xy = add(xy,lastValSplatXY);
        lastValSplatXY = permute4<3,3>(xy);
        store(4*width*height+pSumArray+h*width+w+i*4,xy);
      }
    }
  }
  // 16 bit 8s for grins...
  // a_0 a_1 a_2 a_3 a_4 a_5 a_6 a_7
  // a_0 a_0+a_1 a_1+a_2 a_2+a_3 a_3+a_4 a_4+a_5 a_5+a_6 a_6+a_7 (>>1)
  // d    d        -a_0  -a_0+a_1 -a_0+a_1+a_2 -a_0+a_1+a_2+a_3 -a_0+a_1+a_2+a_3+a_4 - -a_0+a_1+a_2+a_3+a_4+a_5
  // d    d        shuffle shuffle shuffle+add    shuffle+add     shuffle+add+shuffle+add  shuffle+add+shuffle+add


  // a_0 a_1 a_2 a_3     a_4         a_5           a_6                    a_7
  //         a_0 a_1     a_2         a_3           a_4                    a_5
  //         a_1 a_2+a_0 a_3+a_1-a_0 a_4+a_2-a1-a0 a_5+a_3-a_2-a_0        a_6+a_4...
  for (unsigned i = 0; i < 5; ++i) {
    for (unsigned h = 0; h+1 < height; ++h)
    {
      for (unsigned w = 0; w < width; w += 16) {
        uint32v4 above = load(i*width*height +pSumArray + h * width + w);
        uint32v4 current = load(i*width*height+pSumArray +(h+1) * width +w);
        store(i*width*height+pSumArray +(h+1) * width +w,add(above,current));
      }
    }
  }

  return sumArray;
}

解决方法

首先-这是执行此操作的最佳方法吗？我可以将整个前缀和数组加倍构建，并且没有转换步骤。

通常，整数计算比每个向量具有相同元素数的浮点计算要快得多。例如，在Skylake上，paddd的延迟为1，互惠吞吐量为0.33个周期，addps-为4和0.5。对于整数计算，如果有乘法运算，由于必须在不同的元素大小之间进行转换或将乘积的上下半部分组合在一起，有时会产生开销。但是通常这样做的结果仍然比FP快，如果您能够合并FMadd的pmaddwd或pmaddubsw指令（也可以是FMA的INT版本），或者也可以丢弃该产品的下半部分或上半部分，则也可以减少它。

说到FMA，FP是因为AVX2具有FMA指令的优点，该指令允许每次乘法免费加一。在您的情况下这是否有利取决于您的算法和输入数据，但是如果您使用整数输入，我仍然会尽可能选择对其进行INT计算。

与FP相比，INT计算的另一个优点是您可以在较小的元素上进行计算，这意味着您可以在一条指令中处理更多数据。当然，只有在您的输入和算法允许的情况下，才有可能这样做，但是特别是在图像处理中，通常是这种情况。

这部分的最后一点是，您应该考虑算法每个阶段需要处理的数据量。 INT到FP的转换不是免费的，因此转换所需的数据越少越好。如果您的中间数据量少于输入量，那么推迟转换将是有益的。

第二个-如果这是正确的方法，那么使用压缩的double simd操作是否可以获得很多好处？我只能确定我会一次拿出2个单位。

好吧，每条指令加倍2次胜于一项，所以在我的书中值得。与标量相比，您还可以对矢量执行更复杂的操作，例如FMA，掩码，混合，最小/最大（尽管编译器甚至可以在标量代码上为您生成一些指令）。如果运行时检测到AVX，则可以将吞吐量提高一倍。

此外，您还应该考虑32位浮点精度是否足够。您可能不需要双精度结果，并且通过使用某些FP技术，可以减少32位FP数学所积累的误差，并且与64位计算相比，仍具有更好的性能。

第三-我应该将数据单元打包在一起还是以其当前使用的平面格式放置？

通常，您应该更喜欢平面输入数据。通常，SIMD特别是SSE / AVX更适合于垂直操作（即，当操作在不同向量的相应元素之间执行时），而不是水平（当操作在相同向量的元素之间执行时）。对于打包的输入，您可能必须对输入数据进行解包和混排，这会增加开销。现代的CPU能够跟踪从内存到内存的多个读或写流，因此硬件预取器应该能够处理不同数据平面上的线性内存访问。

第3部分-至少，《英特尔®64和IA-32架构优化参考手册》将平面格式指定为通常更理想的格式。所以我可以一次性解决。看起来我可以为16x16（255 * 256 ** 2）的窗口执行浮点值而不是双精度值，因为在溢出之前这是可以表示的。从这个角度来看，我受益于每个向量4个浮点数而不是2个双精度点。有帮助这样看来，我可以从四个浮点数中得出平均值计算的总和，并在最后进行计算。因此，如果我根据最大窗口大小来限制功能，则可以确定何时使用float和何时使用double。实现没有太大不同。

我想直到尝试并看到它，我才知道它的真正性能。

关于第1部分，根据整数的数量及其范围，使用64位整数而不是双精度可能会更好。一般而言，速度稍快，尤其是延迟时间，也更加精确。这是SSE2中的一些内容。

// Return acc64 += a32. The a32 is viewed as uint32_t lanes,the accumulator is uint64_t
inline __m128i integerAdd( __m128i a32,__m128i acc64 )
{
    const __m128i low = _mm_and_si128( a32,_mm_set1_epi64x( UINT_MAX ) );
    acc64 = _mm_add_epi64( acc64,low );
    const __m128i high = _mm_srli_epi64( a32,32 );
    acc64 = _mm_add_epi64( acc64,high );
    return acc64;
}

// Compute a32 * b32,add to the accumulator. The two inputs are viewed as uint32_t lanes,the accumulator is uint64_t
inline __m128i integerFma( __m128i a32,__m128i b32,__m128i acc64 )
{
    const __m128i low = _mm_mul_epu32( a32,b32 );
    a32 = _mm_srli_si128( a32,4 );
    b32 = _mm_srli_si128( b32,4 );
    const __m128i high = _mm_mul_epu32( a32,b32 );
    acc64 = _mm_add_epi64( acc64,low );
    acc64 = _mm_add_epi64( acc64,high );
    return acc64;
}

// Add both 64-bit lanes of the accumulator,convert to double
inline double accumulatedValue( __m128i acc64 )
{
    acc64 = _mm_add_epi64( acc64,_mm_unpackhi_epi64( acc64,acc64 ) );
    const uint64_t v = (uint64_t)_mm_cvtsi128_si64( acc64 );
    return (double)v;
}

更新：平均值为sum(x) / count。您不需要小数来计算整数之和，只需要64位整数就可以避免溢出（如果您有很多值）。您只需为最后一条指令使用非整数即可。

对于差异，您需要sum((x-mean)^2)。可以将其重写为sum(x^2) + mean * ( mean - 2 * sum(x) )。现在很明显，您只需要对输入数据进行一次遍历，即可计算平方和和值和。如果输入太多而又太大，则计算平方和会溢出64位整数。但是对于许多实际应用而言，例如如果您的输入在[0 .. 65535]范围内，则只有40亿个样本后64位整数才会溢出，那么您可能没有那么多。