uint8使用SIMD Neon内部函数浮动

问题描述

我正在尝试优化将Neon A64 / v8上运行的灰度图像转换为浮动图像的代码

使用OpenCV的convertTo()(为Android编译),当前的实现相当快,但这仍然是我们的瓶颈。

所以我想出了以下代码,想听听可能的改进。

如果可以帮助的话,图像的高度和宽度大约是16倍。

我正在为此运行for循环:

static void u8_2_f(unsigned char* in,float* out)
{
    //1 u8x8->u16x8
    uint8x8_t u8x8src = vld1_u8(in);
    uint16x8_t u16x8src = vmovl_u8(u8x8src);

    //2 u16x8 -> u32x4high,u32x4low
    uint32x4_t u32x4srch = vmovl_u16(vget_high_u16(u16x8src));
    uint32x4_t u32x4srcl = vmovl_u16(vget_low_u16(u16x8src));

    //3 u32x4high,u32x4low -> f32x4high,f32x4low
    vst1q_f32(out,vcvtq_f32_u32(u32x4srch));
    vst1q_f32(out+4,vcvtq_f32_u32(u32x4srcl));
}

解决方法

为可能的改进,请尝试使用此功能替换vcvtq_f32_u32。它是2条指令,而不是1条指令,但是在某些CPU上它们可能会更快。

// Convert bytes to float,assuming the input is within [ 0 .. 0xFF ] interval
inline float32x4_t byteToFloat( uint32x4_t u32 )
{
    // Floats have 23 bits of mantissa.
    // We want least significant 8 bits to be shifted to [ 0 .. 255 ],therefore need to add 2^23
    // See this page for details: https://www.h-schmidt.net/FloatConverter/IEEE754.html
    // If you want output floats in [ 0 .. 255.0 / 256.0 ] interval,change into 2^15 = 0x47000000
    constexpr uint32_t offsetValue = 0x4b000000;
    // Check disassembly & verify your compiler has moved this initialization outside the loop
    const uint32x4_t offsetInt = vdupq_n_u32( offsetValue );
    // Bitwise is probably slightly faster than addition,delivers same results for our input
    u32 = vorrq_u32( u32,offsetInt );
    // The only FP operation required is subtraction,hopefully faster than UCVTF
    return vsubq_f32( vreinterpretq_f32_u32( u32 ),vreinterpretq_f32_u32( offsetInt ) );
}