使用函数 _mm_clflush 刷新大型结构的正确方法循环遍历任意大小未对齐结构的每个缓存行

问题描述

我开始使用 _mm_clflush、_mm_clflushopt 和 _mm_clwb 等函数。

现在说，因为我已经定义了一个结构名称 mystruct，它的大小是 256 字节。我的缓存行大小是 64 字节。现在我想刷新包含 mystruct 变量的缓存行。以下哪种方式是正确的？

_mm_clflush(&mystruct)

或

for (int i = 0; i < sizeof(mystruct)/64; i++) {

     _mm_clflush( ((char *)&mystruct) + i*64)

}

解决方法

clflush CPU 指令不知道您的结构体的大小；它只刷新一个缓存行，该行包含指针操作数指向的字节。（C 内在函数将其公开为 const void*，但 char* 也有意义，特别是考虑到 asm documentation 将其描述为 8 位内存操作数。）

如果您的结构不是 alignas(64)，您需要 4 次刷新 64 字节，或者可能需要 5，因此它可以在 5 个不同的行中包含部分。（您可以无条件地刷新结构的最后一个字节，而不是使用更复杂的逻辑来检查它是否在您尚未刷新的缓存行中，这取决于 clflush 与更多逻辑的相对成本以及可能的分支预测错误。）

您的原始循环在结构的开头对 4 个相邻字节进行了 4 次刷新。
使用指针增量可能是最简单的，这样转换就不会与关键逻辑混淆。

// first attempt,a bit clunky:
    const int LINESIZE = 64;
    const char *lastbyte = (const char *)(&mystruct+1) - 1;
    for (const char *p = (const char *)&mystruct; p <= lastbyte ; p+=LINESIZE) {
         _mm_clflush( p );
    }
    // if mystruct is guaranteed aligned by 64,you're done.  Otherwise not:

    // check if next line to maybe flush contains the last byte of the struct; if not then it was already flushed.
    if( ((uintptr_t)p ^ (uintptr_t)lastbyte) & -LINESIZE == 0 )
        _mm_clflush( lastbyte );

x^y 在它们不同的位位置中为 1。 x & -LINESIZE 丢弃地址的行内偏移位，只保留行号位。因此，我们可以仅通过 XOR 和 TEST 指令查看 2 个地址是否在同一高速缓存行中。（或者 clang 将其优化为更短的 cmp 指令）。

或者将其重写为单个循环，使用 if 逻辑作为终止条件：

我使用了 C++ struct foo &var 引用，因此我可以遵循您的 &var 语法，但仍然可以看到它如何为采用指针 arg 的函数进行编译。适应 C 很简单。

循环遍历任意大小未对齐结构的每个缓存行

/* I think this version is best: 
  * compact setup / small code-size
  * with no extra latency for the initial pointer
  * doesn't need to peel a final iteration
*/
inline
void flush_structfoo(struct foo &mystruct) {
    const int LINESIZE = 64;
    const char *p = (const char *)&mystruct;
    uintptr_t endline = ((uintptr_t)&mystruct + sizeof(mystruct) - 1) | (LINESIZE-1);
    // set the offset-within-line address bits to get the last byte 
    // of the cacheline containing the end of the struct.

    do {   // flush while p is in a cache line that contains any of the struct
         _mm_clflush( p );
          p += LINESIZE;
    } while(p <= (const char*)endline);
}

对于 x86-64 的 GCC10.2 -O3，这个 compiles nicely (Godbolt)

flush_v3(foo&):
        lea     rax,[rdi+255]
        or      rax,63
.L11:
        clflush [rdi]
        add     rdi,64
        cmp     rdi,rax
        jbe     .L11
        ret

不幸的是，如果您使用 alignas(64) struct foo{...};，GCC 不会展开，也不会优化得更好。您可以使用 if (alignof(mystruct) >= 64) { ... } 来检查是否需要特殊处理来让 GCC 优化得更好，否则只需使用 end = p + sizeof(mystruct); 或 end = (const char*)(&mystruct+1) - 1; 或类似的。

（在 C 中，#include <stdalign.h> 用于 #define 用于 alignas() 和 alignof()，如 C++，而不是 ISO C11 _Alignas 和 _Alignof 关键字。）>

另一种选择是这个，但它更笨拙并且需要更多的设置工作。

    const int LINESIZE = 64;
    uintptr_t line = (uintptr_t)&mystruct & -LINESIZE;
    uintptr_t lastline = ((uintptr_t)&mystruct + sizeof(mystruct) - 1) & -LINESIZE;
    do {               // always at least one flush; works on small structs
         _mm_clflush( (void*)line );
          line += LINESIZE;
    } while(line < lastline);

一个 257 字节的结构体总是恰好接触 5 个缓存行，不需要检查。或者一个已知按 4.IDK 对齐的 260 字节结构体，如果我们可以让 GCC 优化基于此的检查。

c clflush cpu-cache sse2

使用函数 _mm_clflush 刷新大型结构的正确方法 循环遍历任意大小未对齐结构的每个缓存行

问题描述

解决方法

循环遍历任意大小未对齐结构的每个缓存行

相关问答

使用函数 _mm_clflush 刷新大型结构的正确方法循环遍历任意大小未对齐结构的每个缓存行