在strlen的实现中减去char *

问题描述

我当时在看C中strlen（）函数的实现。我需要了解它在我的一项任务中的工作。

#define ALIGN (sizeof(size_t))
#define ONES ((size_t)-1/UCHAR_MAX)
#define HIGHS (ONES * (UCHAR_MAX/2+1))
#define HASZERO(x) ((x)-ONES & ~(x) & HIGHS)

size_t strlen(const char *s)
{
    const char *a = s;
    const size_t *w;
    for (; (uintptr_t)s % ALIGN; s++) if (!*s) return s-a;
    for (w = (const void *)s; !HASZERO(*w); w++);
    for (s = (const void *)w; *s; s++);
    return s-a;
}

在“ return s-a”语句中，我不明白char *的减法是什么。

这是musl的出色实现。 glibc的strlen（）实现也使用此char *减法。

解决方法

带有注释的代码说明：

size_t strlen(const char *s)
{
    const char *a = s;      // store a copy pointing at the start of the original        
    const size_t *w;
    for (; (uintptr_t)s % ALIGN; s++) // in case of misalignment,look for first aligned address
      if (!*s) return s-a; // if we encounter \0 while doing so,return the string length
    for (w = (const void *)s; !HASZERO(*w); w++); // work with word-sized chunks and do lookup
    for (s = (const void *)w; *s; s++); // find the exact location of \0 in the final word
    return s-a; // end minus beginning = length
}

有关C语言兼容性的注意事项：

w = (const void *)s依赖于非标准扩展，而*w则调用未定义的行为。这是库代码，因此有时可能会使用诸如-fno-strict-aliasing之类的特定设置进行编译。
s-a实际上是ptrdiff_t类型，而不是size_t类型。因此，可能需要强制转换以使编译器警告静音。
size_t不一定是实现的最大对齐类型，它可以更大。我认为最适用于32位及更高版本的类型将是uint_fast32_t。编译器/库应将此类型设置为32或64位，具体取决于32/64位CPU上实际最快的速度。
像这样的库实现有时会读取传递的字符串末尾以外的字大小的块。假设万一字符串没有以对齐的地址结尾，则无害的填充字节将在该地址存在并可以访问。这绝不是C标准所能保证的（这样做是对数组进行边界访问UB的保证），但可能是由本地实现来保证的。

应该有可能将这段代码解压缩为更具可读性和自说明性的内容，而又不影响性能。我们可以在解决上述问题的同时，解决其中的一些问题。可能与以下内容类似（未经测试/标竿）：

#include <stdint.h>
#include <limits.h>

#define ONES ((uint_fast32_t)-1/UCHAR_MAX)
#define HIGHS (ONES * (UCHAR_MAX/2+1))
#define HASZERO(x) ((x)-ONES & ~(x) & HIGHS)

size_t strlen (const char* s)
{
  const char* begin = s;
  const char* end   = s;

  for (; (uintptr_t)end % _Alignof(uint_fast32_t); end++)
  {
    if (*end == '\0') 
    {
      return (size_t)(end - begin);
    }
  }
  
  const uint_fast32_t* word;
  for (word = (const void*)end; !HASZERO(*word); word++)
  {}
  
  for (end = (const void*)word; end != '\0'; end++)
  {}
  
  return (size_t)(end - begin);
}

假设您有字符串"Hello world"。该字符串以数组形式存储在计算机内存中，并以特殊的“空”字符（'\0'）结尾。

数组看起来像这样：

+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
| 'H' | 'e' | 'l' | 'l' | 'o' | ' ' | 'w' | 'o' | 'l' | 'd' | '\0' |
+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+

调用此函数时（如strlen("Hello world")），则s将指向数组中的第一个字符。 a的初始化还将使其指向数组的第一个字符。

这三个循环修改了s，因此它将指向终止的空字符。

如果我们再次显示数组，但是现在有了指针，它将是这样的：

+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
| 'H' | 'e' | 'l' | 'l' | 'o' | ' ' | 'w' | 'o' | 'l' | 'd' | '\0' |
+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
^                                                           ^
|                                                           |
a                                                           s

s - a的工作是计算两个指针s和a的差（在 array元素中）。差异将为10，即字符串的长度（不计零终止符）。

c char libraries strlen subtraction

在strlen的实现中减去char *

问题描述

解决方法

相关问答