为什么 malloc() 分配的字节比预期的多 2 个字节？

问题描述

我正在编写一个 c 编译器。 Flex 识别我的字符串标记并将其发送到函数以将其存储在包含有关它的信息的 struct{} 中，但首先该字符串需要删除转义字符，即 ''。这是我的代码：

char* removeEscapeChars(char* svalue)
{
    char* processedString; //will be the string with escape characters removed
    int svalLen = strlen(svalue);
    printf("svalLen (size of string passed in): %d\n",svalLen);
    printf("svalue (string passed in): %s\n",svalue);
    int foundEscapedChars = 0;
    for (int i = 0; i < svalLen;) 
    {
        if (svalue[i] == '\\') {
            //Found escaped character
            if (svalue[i+1] == 'n') {
                //Found newline character
                svalue[i] = int('\n');
            }
            else if (svalue[i+1] == '0') {
                //Found null character
                svalue[i] = int('\0');
            }
            else {
                //Any other character
                svalue[i] = svalue[i+1];
            }
            i++;
            foundEscapedChars++;
            for (int j = i; j < svalLen + 1; j++) {
                svalue[j] = svalue[j+1];
            }
        }
        else {
            i++;
        }
    }
    int newSize = svalLen - foundEscapedChars;
    processedString = (char*) malloc(newSize * sizeof(char));
    memcpy(processedString,svalue,newSize * sizeof(char));
    printf("newSize: %d\n",newSize);
    printf("processedString: %s\n",processedString);
    printf("processedString Size: %d\n",strlen(processedString));
    
    free(svalue);
    return processedString;
}

它在 99% 的情况下都可以工作，但是当它在这个特定的字符串（或类似的 40 个字符）“-//W3C//DTD XHTML 1.0 Transitional//EN”上测试时，malloc() 似乎是为一个 2 字节太大的字符串分配内存。其输出如下。请注意，我在调用 malloc() 时使用了 int newSize，它说它的值为 40，然后 strlen() 返回 42。 sizeof(char) 也是 == 1。主要问题是它在字符串末尾插入垃圾字符。什么给？

"-//W3C//DTD XHTML 1.0 Transitional//EN"
svalLen (size of string passed in): 40
svalue (string passed in) "-//W3C//DTD XHTML 1.0 Transitional//EN"
newSize: 40
processedString: "-//W3C//DTD XHTML 1.0 Transitional//EN"Z
processedString Size: 42
Line 47 Token: STRINGCONST Value: "-//W3C//DTD XHTML 1.0 Transitional//EN"Z Len: 40 Input: "-//W3C//DTD XHTML 1.0 Transitional//EN"

解决方法

代码至少有这个问题：试图打印一个不是 string 的“字符串”，因为它缺少终止 空字符 和存储它的空间。

这会导致未定义的行为。此 UB 可能会显示为打印额外字符。

// processedString = (char*) malloc(newSize * sizeof(char));
// memcpy(processedString,svalue,newSize * sizeof(char));
processedString = malloc(newSize + 1);
memcpy(processedString,newSize);
processedString[new_Size] = 0;

可能还有其他问题。

这是对您的代码进行的修改，它采用了一种不同的、更传统的方法来处理字符串。首先从计算转义字符的函数开始，因为这在下一步中很有用：

int escapeCount(char* str) {
    int c = 0;

    // Can just increment and work through the string using the given pointer
    while (*str) {
        // Backslash something here
        if (*str == '\\') {
            ++str;
            ++c;
        }

        if (*str) {
          // Handle unmatched \ at end of string
          ++str;
        }
    }

    return c;
}

现在使用该信息您可以分配正确的缓冲区大小：

char* removeEscapeChars(char* str)
{
    // IMPORTANT: Allocate strlen() + 1 for the NUL byte not counted
    char* result = malloc(strlen(str) - escapeCount(str) + 1);
    char* r = result;

    do {
        if (*str == '\\') {
            ++str;

            switch (*str) {
                case 'n':
                    *r = '\n';
                    break;
                case 'r':
                    *r = '\r';
                    break;
                case 't':
                    *r = '\t';
                    break;
                default:
                    *r = *str;
                    break;
            }
        }
        else {
            *r = *str;
        }

        if (*str) {
          ++str;
        }

        ++r;
    } while(*str);

    return result;
}

c compiler-construction flex-lexer parsing parsing