如果被其他函数包围，为什么较慢的函数运行得更快？

问题描述

只是一点点 C++ 代码，在 java 中确认了行为。

这是示例代码，用于重现使用 Visual Studio 2019 Release x64 编译的此行为。我得到：

611ms 仅增量元素。

具有缓存的增量元素为 631 毫秒，因此额外的开销为 20 毫秒。

但是当我在每次增量之前添加大量操作时（我选择了随机数生成）并得到：

2073ms 仅增量元素。

使用缓存的增量元素为 1432 毫秒。

我有 intel cpu 10700K，如果重要的话还有 3200RAM。

#include <iostream>
#include <random>
#include <chrono>
#include <cstdlib>


#define ARR_SIZE 256 * 256 * 256 
#define ACCESS_SIZE 256 * 256
#define CACHE_SIZE 1024 
#define IteraTIONS 1000

using namespace std;
using chrono::high_resolution_clock;
using chrono::duration_cast;
using chrono::milliseconds;

int* arr;
int* cache;
int counter = 0;

void flushCache() {
    for (int j = 0; j < CACHE_SIZE; ++j)
    {
        ++arr[cache[j]];
    }
    counter = 0;
}

void incWithCache(int i) {
    cache[counter] = i;
    ++counter;
    if (counter == CACHE_SIZE) {
        flushCache();
    }
}

void incWithoutCache(int i) {
    ++arr[i];
}

int heavyOp() {
    return rand() % 107;
}

void main()
{
    arr = new int[ARR_SIZE];
    cache = new int[CACHE_SIZE];
    int* access = new int[ACCESS_SIZE];

    random_device rd;
    mt19937 gen(rd());

    for (int i = 0; i < ACCESS_SIZE; ++i) {
        access[i] = gen() % (ARR_SIZE);
    }
    for (int i = 0; i < ARR_SIZE; ++i) {
        arr[i] = 0;
    }


    auto t1 = high_resolution_clock::Now();
    for (int iter = 0; iter < IteraTIONS; ++iter) {
        for (int i = 0; i < ACCESS_SIZE; ++i) {
            incWithoutCache(access[i]);
        }
    }
    auto t2 = high_resolution_clock::Now();
    auto ms_int = duration_cast<milliseconds>(t2 - t1);
    cout << "Time without cache " << ms_int.count() << "ms\n";

    t1 = high_resolution_clock::Now();
    for (int iter = 0; iter < IteraTIONS; ++iter) {
        for (int i = 0; i < ACCESS_SIZE; ++i) {
            incWithCache(access[i]);
        }
        flushCache();
    }
    t2 = high_resolution_clock::Now();
    ms_int = duration_cast<milliseconds>(t2 - t1);
    cout << "Time with cache " << ms_int.count() << "ms\n";


    t1 = high_resolution_clock::Now();
    for (int iter = 0; iter < IteraTIONS; ++iter) {
        for (int i = 0; i < ACCESS_SIZE; ++i) {
            heavyOp();
            incWithoutCache(access[i]);
        }
    }
    t2 = high_resolution_clock::Now();
    ms_int = duration_cast<milliseconds>(t2 - t1);
    cout << "Time without cache and time between " << ms_int.count() << "ms\n";

    t1 = high_resolution_clock::Now();
    for (int iter = 0; iter < IteraTIONS; ++iter) {
        for (int i = 0; i < ACCESS_SIZE; ++i) {
            heavyOp();
            incWithCache(access[i]);
        }
        flushCache();
    }
    t2 = high_resolution_clock::Now();
    ms_int = duration_cast<milliseconds>(t2 - t1);
    cout << "Time with cache and time between " << ms_int.count() << "ms\n";
}

解决方法

我认为这类问题非常难以回答 - 优化编译器、指令重新排序和缓存都让这些问题难以分析，但我确实有一个假设。

首先，incWithoutCache 和没有 incWithCache 的 heavyOp 之间的区别似乎是合理的 - 第二个只是做更多的工作。

当您介绍 heavyOp 时，它就会变得有趣。

heavyOp + incWithoutCache：incWithoutCache 需要从内存中提取以输出到 arr。当该内存获取完成时，它可以进行添加。由于流水线，处理器可能会在增量完成之前开始下一个 heavyOp 操作。

heavyOp + incWithCache：incWithCache 不需要在每次迭代中从内存中获取，因为它只需要写出一个值。处理器可以对写入内存控制器的内容进行排队并继续。它确实执行 ++counter，但在这种情况下，您总是访问相同的值，因此我认为这可以比 ++arr[i] 来自 incWithoutCache 的缓存更好。当缓存被刷新时，刷新循环可能会被大量流水线化 - 刷新循环的每次迭代都是独立的，因此一次将运行如此多的迭代。

所以我认为这里的最大区别在于，如果没有缓存，对 arr 的实际写入无法高效地进行管道传输，因为 heavyOp 正在破坏您的管道并可能破坏您的缓存。您的 heavyOp 在两种情况下都花费相同的时间，但在 heavyOp + incWithoutCache 中，写入 arr 的摊销成本更高，因为它不与其他写入 {{1 }}，例如 arr 可能发生的情况。

我认为矢量化理论上可以用于刷新操作，但我没有在编译器资源管理器上看到它，所以这可能不是造成差异的原因。如果使用矢量化可以解释这种速度差异。

我会说我不是这方面的专家，很容易在所有这些方面完全错误......但对我来说是有道理的。

c++cpu-cache performance performance performance windows windows