问题描述
从此参考文献:
它指出string.length的值实际上是UTF-16代码单元的数目,而不是字符数。
我很天真地假设任何3或4字节宽的UTF-8字符都必须占用2个UTF-16代码单元。这是我的意思:
我对字符串?œ´®†¥¨ˆøπ¬˚∆˙©ƒ∂ßåΩ≈ç√∫˜µ≤ユーザーコードa
进行了一些试验,该字符串包含1、2、3和4个字节宽的字符的混合。我得到了一些令人惊讶的结果。
每个UTF-16代码单元为2个字节宽。字符串中的字符数为35。采用字符串的string.length等于36,这意味着只有一个字符采用2个UTF-16代码单元,但是有几个UTF-8字符分别为3和4字节宽。
使用下面的代码,我检查了每个UTF-8字符,使用的字节数以及它的string.length。对我来说很有趣的是,所有3个字节的UTF-8字符都只使用一个UTF-16代码单元。需要2个代码单元的唯一字符是4字节宽的图释。
有人可以解释一下这怎么可能吗?谢谢!
代码:
function iterateCharacters(str) {
let te = new TextEncoder();
let totalBytes = 0;
let totalCodeUnits1 = 0;
let totalCodeUnits2 = 0;
let arr = [...str];
for (let i = 0; i < arr.length; i++) {
let bytes = te.encode(arr[i]).length;
let length = arr[i].length;
totalBytes += bytes;
console.log(" i: " + i + " char: " + arr[i] + " bytes: " + bytes + " length: " + length);
// Erroneous assumption that more than 2 utf8 bytes would occupy 2 UTF-16 code units:
totalCodeUnits1 += bytes < 3 ? 1 : 2;
totalCodeUnits2 += length;
}
console.log(" total UTF-16 code units (erroneous calculation): " + totalCodeUnits1)
console.log(" total UTF-16 code units (correct calculation): " + totalCodeUnits2)
console.log(" total characters: " + arr.length)
console.log(" total UTF-8 bytes: " + totalBytes)
}
var sample = "?œ´®†¥¨ˆøπ¬˚∆˙©ƒ∂ßåΩ≈ç√∫˜µ≤ユーザーコードa";
iterateCharacters(sample);
console.log("total number of UTF-16 code units: " + sample.length);
console.log("total number of characters: " + [...sample].length);
console.log("total number of UTF-8 bytes: " + (new TextEncoder().encode(sample)).length);
结果:
i: 0 char: ? bytes: 4 length: 2
i: 1 char: œ bytes: 2 length: 1
i: 2 char: ´ bytes: 2 length: 1
i: 3 char: ® bytes: 2 length: 1
i: 4 char: † bytes: 3 length: 1
i: 5 char: ¥ bytes: 2 length: 1
i: 6 char: ¨ bytes: 2 length: 1
i: 7 char: ˆ bytes: 2 length: 1
i: 8 char: ø bytes: 2 length: 1
i: 9 char: π bytes: 2 length: 1
i: 10 char: ¬ bytes: 2 length: 1
i: 11 char: ˚ bytes: 2 length: 1
i: 12 char: ∆ bytes: 3 length: 1
i: 13 char: ˙ bytes: 2 length: 1
i: 14 char: © bytes: 2 length: 1
i: 15 char: ƒ bytes: 2 length: 1
i: 16 char: ∂ bytes: 3 length: 1
i: 17 char: ß bytes: 2 length: 1
i: 18 char: å bytes: 2 length: 1
i: 19 char: Ω bytes: 2 length: 1
i: 20 char: ≈ bytes: 3 length: 1
i: 21 char: ç bytes: 2 length: 1
i: 22 char: √ bytes: 3 length: 1
i: 23 char: ∫ bytes: 3 length: 1
i: 24 char: ˜ bytes: 2 length: 1
i: 25 char: µ bytes: 2 length: 1
i: 26 char: ≤ bytes: 3 length: 1
i: 27 char: ユ bytes: 3 length: 1
i: 28 char: ー bytes: 3 length: 1
i: 29 char: ザ bytes: 3 length: 1
i: 30 char: ー bytes: 3 length: 1
i: 31 char: コ bytes: 3 length: 1
i: 32 char: ー bytes: 3 length: 1
i: 33 char: ド bytes: 3 length: 1
i: 34 char: a bytes: 1 length: 1
total UTF-16 code units (erroneous calculation): 50
total UTF-16 code units (correct calculation): 36
total characters: 35
total UTF-8 bytes: 85
total number of UTF-16 code units: 36
total number of characters: 35
total number of UTF-8 bytes: 85
(另请参见Jsfiddle:https://jsfiddle.net/Allasso/o5zpmrc9/)
解决方法
UTF-16与UTF-8不同。低于00010000(十六进制)的所有Unicode字符都可以用一个UTF-16字符表示。在此之上,您溢出为2个UTF-16字符。但是,这意味着所有3个字节的UTF-8编码都适合单个UTF-16字符。
请记住,UTF-8字符的3个字节没有完全用于实际的代码点(数字)。其中一些位被“标记”位占用,这些位向解释软件指示代码序列已开始。 UTF-16也是如此,但是方案(标记位模式)不同。