在向量查询表上使用“ const”由另一个常量索引会导致性能下降

问题描述

在此特定代码中的const数组上使用mask[]会导致性能下降。 (摘自Is there an intrinsic function to zero out the last n bytes of a __m128i vector?

#include <benchmark/benchmark.h>
#include <immintrin.h>
#include <stdalign.h>
#include <stdint.h>

alignas(32) const char mask[] = {0,-1,-1};

inline __m128i zeroLowestNBytes(__m128i x,uint32_t n) {
  __m128i m = _mm_loadu_si128((__m128i*)&mask[16 - n]);
  return _mm_and_si128(x,m);
}

// Uses mask array
static void probe_sse(benchmark::State& state) {
  uint16_t hash[16] = {3,4,100,52,22,12,53,32,45,67,23,66};

  uint8_t ele = 100;
  __m128i _hash = _mm_set_epi8(3,66);
  __m128i _ele = _mm_set1_epi8(ele);
  volatile uint16_t tmp;
  uint16_t match;
  int i;

  for (auto _ : state) {
    match = _mm_movemask_epi8(zeroLowestNBytes(_mm_cmpeq_epi8(_ele,_hash),6));
    tmp = match;

    while (tmp) {
      i = _tzcnt_u32(tmp);
      tmp = _blsr_u32(tmp);
    }
  }
}

// ------------------------------------------------------------------
// Doesn't use mask array
static void probe_bitwise_and(benchmark::State& state) {
  uint16_t hash[16] = {3,66);
  __m128i _ele = _mm_set1_epi8(ele);
  volatile uint16_t tmp;
  uint16_t match;
  int i;

  for (auto _ : state) {
    match = _mm_movemask_epi8(_mm_cmpeq_epi8(_ele,_hash));
    tmp = match & 0b1111111111000000;

    while (tmp) {
      i = _tzcnt_u32(tmp);
      tmp = _blsr_u32(tmp);
    }
  }
}

// ------------------ Run the benchmark -----------------------
BENCHMARK(probe_sse);
BENCHMARK(probe_bitwise_and);
BENCHMARK_MAIN();

我使用以下代码运行代码

clang++ bm.cc -std=c++11 -isystem benchmark/include -Lbenchmark/build/src -lbenchmark -lpthread -o bm -march=native -O3 && ./bm

基准测试结果针对的是上述代码在i5-5250U英特尔broadwell cpu(2015 Macbook Air 13英寸)上的结果:

Run on (4 X 1600 MHz cpu s)
cpu Caches:
  L1 Data 32 KiB (x2)
  L1 Instruction 32 KiB (x2)
  L2 Unified 256 KiB (x2)
  L3 Unified 3072 KiB (x1)
Load Average: 1.41,1.74,2.23
***WARNING*** Library was built as DEBUG. Timings may be affected.
------------------------------------------------------------
Benchmark                  Time             cpu   Iterations
------------------------------------------------------------
probe_sse               4.71 ns         4.67 ns    147819348
probe_bitwise_and       4.83 ns         4.80 ns    148097059

但是,删除probe_bitwise_and函数可以解决此问题。从我的代码删除probe_bitwise_and函数后的基准测试:

Run on (4 X 1600 MHz cpu s)
cpu Caches:
  L1 Data 32 KiB (x2)
  L1 Instruction 32 KiB (x2)
  L2 Unified 256 KiB (x2)
  L3 Unified 3072 KiB (x1)
Load Average: 1.94,1.80,2.05
***WARNING*** Library was built as DEBUG. Timings may be affected.
-----------------------------------------------------
Benchmark           Time             cpu   Iterations
-----------------------------------------------------
probe_sse        3.75 ns         3.67 ns    202104777

我反复运行基准测试,结果一致。 const不应影响运行时性能,但在此特定情况下会影响。有人可以解释为什么会这样吗?

(编者注:看来被测代码是用clang++ -O3 -march=native编译的,但是基准库本身是未优化的(?)调试版本。)

解决方法

暂无找到可以解决该程序问题的有效方法,小编努力寻找整理中!

如果你已经找到好的解决方法,欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@)