问题描述
假设对齐内存uint32 *p
中有很多uint32s存储,如何使用simd将它们转换为uint8s?
我看到有_mm256_cvtepi32_epi8 / vpmovdb,但它属于avx512,而我的cpu不支持它?
解决方法
如果您真的有很多,我会做这样的事情(未经测试)。
主循环每次迭代读取64个字节,其中包含16个uint32_t值,对实现截断的字节进行混洗,将结果合并到单个寄存器中,并使用向量存储指令写入16个字节。
void convertToBytes( const uint32_t* source,uint8_t* dest,size_t count )
{
// 4 bytes of the shuffle mask to fetch bytes 0,4,8 and 12 from a 16-bytes source vector
constexpr int shuffleScalar = 0x0C080400;
// Mask to shuffle first 8 values of the batch,making first 8 bytes of the result
const __m256i shuffMaskLow = _mm256_setr_epi32( shuffleScalar,-1,shuffleScalar,-1 );
// Mask to shuffle last 8 values of the batch,making last 8 bytes of the result
const __m256i shuffMaskHigh = _mm256_setr_epi32( -1,shuffleScalar );
// Indices for the final _mm256_permutevar8x32_epi32
const __m256i finalPermute = _mm256_setr_epi32( 0,5,2,7,7 );
const uint32_t* const sourceEnd = source + count;
// Vectorized portion,each iteration handles 16 values.
// Round down the count making it a multiple of 16.
const size_t countRounded = count & ~( (size_t)15 );
const uint32_t* const sourceEndAligned = source + countRounded;
while( source < sourceEndAligned )
{
// Load 16 inputs into 2 vector registers
const __m256i s1 = _mm256_load_si256( ( const __m256i* )source );
const __m256i s2 = _mm256_load_si256( ( const __m256i* )( source + 8 ) );
source += 16;
// Shuffle bytes into correct positions; this zeroes out the rest of the bytes.
const __m256i low = _mm256_shuffle_epi8( s1,shuffMaskLow );
const __m256i high = _mm256_shuffle_epi8( s2,shuffMaskHigh );
// Unused bytes were zeroed out,using bitwise OR to merge,very fast.
const __m256i res32 = _mm256_or_si256( low,high );
// Final shuffle of the 32-bit values into correct positions
const __m256i res16 = _mm256_permutevar8x32_epi32( res32,finalPermute );
// Store lower 16 bytes of the result
_mm_storeu_si128( ( __m128i* )dest,_mm256_castsi256_si128( res16 ) );
dest += 16;
}
// Deal with the remainder
while( source < sourceEnd )
{
*dest = (uint8_t)( *source );
source++;
dest++;
}
}