_mm_stream_si128 比 _mm_store_si128 慢 2000%

问题描述

我正在编写一些 C 代码来制作随机数生成器,并使用了以下代码:

// header
typedef struct {
    uint64_t values[2];
} fy5z_state_t;
fy5z_state_t fy5z_seed(uint64_t seed_value);

uint64_t fy5z_generate(fy5z_state_t* state);
// source
fy5z_state_t fy5z_seed(uint64_t seed_value)
{
    fy5z_state_t state;
    state.values[0] = (seed_value & 0xFFFFFFFF);
    state.values[1] = (seed_value & (0xFFFFFFFF << 31)) + seed_value;
    return state;
}

uint64_t fy5z_generate(fy5z_state_t* state)
{
    __m128i got_data = _mm_load_si128((__m128i const*)(state->values));
    __m128i shuffled = _mm_shuffle_epi32(got_data,0x8d);
    __m128i final_add = _mm_add_epi8 (shuffled,got_data);
    _mm_store_si128((__m128i*)(state->values),final_add);
    return state->values[0];
}

此外,以下代码用于对性能进行计时:

#ifndef TRIAL_COUNT
#define TRIAL_COUNT 1024 * 1024 * 10
#endif

static void print_time_us(const char* name,void(*fn)(void))
{
    struct timespec start,end;
    clock_gettime(CLOCK_MONOTONIC_RAW,&start);
    fn();
    clock_gettime(CLOCK_MONOTONIC_RAW,&end);
    uint64_t delta_us = (end.tv_sec - start.tv_sec) * 1000000 + (end.tv_nsec - start.tv_nsec) / 1000;
    printf("Running: '%s' took %llu u/s\n",name,delta_us);
}

static void test_fy5z(void) {
    fy5z_state_t fseed = fy5z_seed(0xfab5381);
    unsigned long total = 0; 
    for (int i = 0; i < TRIAL_COUNT; ++i)
    {
        total += fy5z_generate(&fseed);
    }
}

我发现,在生成器函数中,如果使用 _mm_store_si128,我得到: Running: 'fy5z' took 113328 u/s ,但如果我用 _mm_stream_si128 交换它,我得到 Running: 'fy5z' took 1956792 u/s .

这是在 MacOS 2.7 GHz 四核 Intel Core i7 上

为什么在这个用例中 storestream 快这么多?

解决方法

暂无找到可以解决该程序问题的有效方法,小编努力寻找整理中!

如果你已经找到好的解决方法,欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@)

相关问答

错误1:Request method ‘DELETE‘ not supported 错误还原:...
错误1:启动docker镜像时报错:Error response from daemon:...
错误1:private field ‘xxx‘ is never assigned 按Alt...
报错如下,通过源不能下载,最后警告pip需升级版本 Requirem...