问题描述
在此特定代码中的const
数组上使用mask[]
会导致性能下降。 (摘自Is there an intrinsic function to zero out the last n bytes of a __m128i vector?)
#include <benchmark/benchmark.h>
#include <immintrin.h>
#include <stdalign.h>
#include <stdint.h>
alignas(32) const char mask[] = {0,-1,-1};
inline __m128i zeroLowestNBytes(__m128i x,uint32_t n) {
__m128i m = _mm_loadu_si128((__m128i*)&mask[16 - n]);
return _mm_and_si128(x,m);
}
// Uses mask array
static void probe_sse(benchmark::State& state) {
uint16_t hash[16] = {3,4,100,52,22,12,53,32,45,67,23,66};
uint8_t ele = 100;
__m128i _hash = _mm_set_epi8(3,66);
__m128i _ele = _mm_set1_epi8(ele);
volatile uint16_t tmp;
uint16_t match;
int i;
for (auto _ : state) {
match = _mm_movemask_epi8(zeroLowestNBytes(_mm_cmpeq_epi8(_ele,_hash),6));
tmp = match;
while (tmp) {
i = _tzcnt_u32(tmp);
tmp = _blsr_u32(tmp);
}
}
}
// ------------------------------------------------------------------
// Doesn't use mask array
static void probe_bitwise_and(benchmark::State& state) {
uint16_t hash[16] = {3,66);
__m128i _ele = _mm_set1_epi8(ele);
volatile uint16_t tmp;
uint16_t match;
int i;
for (auto _ : state) {
match = _mm_movemask_epi8(_mm_cmpeq_epi8(_ele,_hash));
tmp = match & 0b1111111111000000;
while (tmp) {
i = _tzcnt_u32(tmp);
tmp = _blsr_u32(tmp);
}
}
}
// ------------------ Run the benchmark -----------------------
BENCHMARK(probe_sse);
BENCHMARK(probe_bitwise_and);
BENCHMARK_MAIN();
clang++ bm.cc -std=c++11 -isystem benchmark/include -Lbenchmark/build/src -lbenchmark -lpthread -o bm -march=native -O3 && ./bm
基准测试结果针对的是上述代码在i5-5250U英特尔broadwell cpu(2015 Macbook Air 13英寸)上的结果:
Run on (4 X 1600 MHz cpu s)
cpu Caches:
L1 Data 32 KiB (x2)
L1 Instruction 32 KiB (x2)
L2 Unified 256 KiB (x2)
L3 Unified 3072 KiB (x1)
Load Average: 1.41,1.74,2.23
***WARNING*** Library was built as DEBUG. Timings may be affected.
------------------------------------------------------------
Benchmark Time cpu Iterations
------------------------------------------------------------
probe_sse 4.71 ns 4.67 ns 147819348
probe_bitwise_and 4.83 ns 4.80 ns 148097059
但是,删除probe_bitwise_and
函数可以解决此问题。从我的代码中删除probe_bitwise_and
函数后的基准测试:
Run on (4 X 1600 MHz cpu s)
cpu Caches:
L1 Data 32 KiB (x2)
L1 Instruction 32 KiB (x2)
L2 Unified 256 KiB (x2)
L3 Unified 3072 KiB (x1)
Load Average: 1.94,1.80,2.05
***WARNING*** Library was built as DEBUG. Timings may be affected.
-----------------------------------------------------
Benchmark Time cpu Iterations
-----------------------------------------------------
probe_sse 3.75 ns 3.67 ns 202104777
我反复运行基准测试,结果一致。 const
不应影响运行时性能,但在此特定情况下会影响。有人可以解释为什么会这样吗?
(编者注:看来被测代码是用clang++ -O3 -march=native
编译的,但是基准库本身是未优化的(?)调试版本。)
解决方法
暂无找到可以解决该程序问题的有效方法,小编努力寻找整理中!
如果你已经找到好的解决方法,欢迎将解决方案带上本链接一起发送给小编。
小编邮箱:dio#foxmail.com (将#修改为@)