3字节宽的UTF-8字符如何仅使用单个UTF-16代码单元?

问题描述

从此参考文献:

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/length#Description

它指出string.length的值实际上是UTF-16代码单元的数目,而不是字符数。

我很天真地假设任何3或4字节宽的UTF-8字符都必须占用2个UTF-16代码单元。这是我的意思:

我对字符串?œ´®†¥¨ˆøπ¬˚∆˙©ƒ∂ßåΩ≈ç√∫˜µ≤ユーザーコードa进行了一些试验,该字符串包含1、2、3和4个字节宽的字符的混合。我得到了一些令人惊讶的结果。

每个UTF-16代码单元为2个字节宽。字符串中的字符数为35。采用字符串的string.length等于36,这意味着只有一个字符采用2个UTF-16代码单元,但是有几个UTF-8字符分别为3和4字节宽。

使用下面的代码,我检查了每个UTF-8字符,使用的字节数以及它​​的string.length。对我来说很有趣的是,所有3个字节的UTF-8字符都只使用一个UTF-16代码单元。需要2个代码单元的唯一字符是4字节宽的图释。

有人可以解释一下这怎么可能吗?谢谢!

代码:

function iterateCharacters(str) {
  let te = new TextEncoder();
  let totalBytes = 0;
  let totalCodeUnits1 = 0;
  let totalCodeUnits2 = 0;

  let arr = [...str];
  for (let i = 0; i < arr.length; i++) {
    let bytes = te.encode(arr[i]).length;
    let length = arr[i].length;
    totalBytes += bytes;
    console.log("    i: " + i + "    char: " + arr[i] + "    bytes: " + bytes + "    length: " + length);
    // Erroneous assumption that more than 2 utf8 bytes would occupy 2 UTF-16 code units:
    totalCodeUnits1 += bytes < 3 ? 1 : 2;
    totalCodeUnits2 += length;
  }
  console.log("    total UTF-16 code units (erroneous calculation):  " + totalCodeUnits1)
  console.log("    total UTF-16 code units (correct calculation):    " + totalCodeUnits2)
  console.log("    total characters:                                 " + arr.length)
  console.log("    total UTF-8 bytes:                                " + totalBytes)
}

var sample = "?œ´®†¥¨ˆøπ¬˚∆˙©ƒ∂ßåΩ≈ç√∫˜µ≤ユーザーコードa";

iterateCharacters(sample);
console.log("total number of UTF-16 code units:  " + sample.length);
console.log("total number of characters:         " + [...sample].length);
console.log("total number of UTF-8 bytes:        " + (new TextEncoder().encode(sample)).length);

结果:

    i: 0    char: ?    bytes: 4    length: 2
    i: 1    char: œ    bytes: 2    length: 1
    i: 2    char: ´    bytes: 2    length: 1
    i: 3    char: ®    bytes: 2    length: 1
    i: 4    char: †    bytes: 3    length: 1
    i: 5    char: ¥    bytes: 2    length: 1
    i: 6    char: ¨    bytes: 2    length: 1
    i: 7    char: ˆ    bytes: 2    length: 1
    i: 8    char: ø    bytes: 2    length: 1
    i: 9    char: π    bytes: 2    length: 1
    i: 10    char: ¬    bytes: 2    length: 1
    i: 11    char: ˚    bytes: 2    length: 1
    i: 12    char: ∆    bytes: 3    length: 1
    i: 13    char: ˙    bytes: 2    length: 1
    i: 14    char: ©    bytes: 2    length: 1
    i: 15    char: ƒ    bytes: 2    length: 1
    i: 16    char: ∂    bytes: 3    length: 1
    i: 17    char: ß    bytes: 2    length: 1
    i: 18    char: å    bytes: 2    length: 1
    i: 19    char: Ω    bytes: 2    length: 1
    i: 20    char: ≈    bytes: 3    length: 1
    i: 21    char: ç    bytes: 2    length: 1
    i: 22    char: √    bytes: 3    length: 1
    i: 23    char: ∫    bytes: 3    length: 1
    i: 24    char: ˜    bytes: 2    length: 1
    i: 25    char: µ    bytes: 2    length: 1
    i: 26    char: ≤    bytes: 3    length: 1
    i: 27    char: ユ    bytes: 3    length: 1
    i: 28    char: ー    bytes: 3    length: 1
    i: 29    char: ザ    bytes: 3    length: 1
    i: 30    char: ー    bytes: 3    length: 1
    i: 31    char: コ    bytes: 3    length: 1
    i: 32    char: ー    bytes: 3    length: 1
    i: 33    char: ド    bytes: 3    length: 1
    i: 34    char: a    bytes: 1    length: 1
    total UTF-16 code units (erroneous calculation):  50
    total UTF-16 code units (correct calculation):    36
    total characters:                                 35
    total UTF-8 bytes:                                85
total number of UTF-16 code units:  36
total number of characters:         35
total number of UTF-8 bytes:        85

(另请参见Jsfiddle:https://jsfiddle.net/Allasso/o5zpmrc9/

解决方法

UTF-16与UTF-8不同。低于00010000(十六进制)的所有Unicode字符都可以用一个UTF-16字符表示。在此之上,您溢出为2个UTF-16字符。但是,这意味着所有3个字节的UTF-8编码都适合单个UTF-16字符。

请记住,UTF-8字符的3个字节没有完全用于实际的代码点(数字)。其中一些位被“标记”位占用,这些位向解释软件指示代码序列已开始。 UTF-16也是如此,但是方案(标记位模式)不同。

相关问答

依赖报错 idea导入项目后依赖报错,解决方案:https://blog....
错误1:代码生成器依赖和mybatis依赖冲突 启动项目时报错如下...
错误1:gradle项目控制台输出为乱码 # 解决方案:https://bl...
错误还原:在查询的过程中,传入的workType为0时,该条件不起...
报错如下,gcc版本太低 ^ server.c:5346:31: 错误:‘struct...