问题描述
我正在使用Powershell 7处理Wikipedia enwik9 1Gb UTF-8文本文件。我没有Unicode \ UTF-8的经验。我已经将偏移量和值捕获到一个dict中,当我使用下面的代码并递增$ i ++时,它们似乎是成对的2,4和6。
- $ line.Length是否对该字符串有效?
- $ i是一个多字节字符,当它移到下一个迭代时,它仍然有效吗?
- 我如何知道此代码包含多少个“字符”?是Substring($ i,1)还是Substring($ i,2)还是Substring($ i,6)?
$text = (Get-Content 'enwik9.txt' -Raw)
$line = $text.Substring($i,10000000)
for ($i = 0; $i -lt $line.Length; $i++) {
$total_cnt++
$s = $line.Substring($i,1)
$n = [int][CHAR]$s #I wanted [byte][char] here
if ($n -ge 128) {
# Now $n is not what I want because it is not ASCII and > 255 a Unicode\multibyte character
}
}
解决方法
我能够回答自己的问题,并根据此页面上的信息找到了可行的解决方案: Ã © and other codes
clear-host
clear
write-host 'Loading enwik9.txt'
$text = (Get-Content 'enwik9.txt' -Raw)
write-host 'Load Complete - processing...'
$line = $text.Substring($i,10000000)
for($i=0;$i -lt $line.Length; $i++)
{
$total_cnt++
$uni=''
$s=$line.Substring($i,1)
$n=[int][CHAR]$s
if($n -ge 128)
{
# how many byte units in this Unicode?
$ns=0
$bin=0
$n=$n-128 #reset the 8th contol bit
$b7 = $n -band 64; if($b7 -eq 64){$ns=1;$n=$n-64} #remove the contorl bits
$b6 = $n -band 32; if($b6 -eq 32){$ns-2;$n=$n-32}
$b5 = $n -band 16; if($b5 -eq 16){$ns=3;$n=$n-16}
$t=[convert]::ToString($n,16).PadLeft(2,'0') #convert int to hex
$bin= [convert]::tostring($n,2)
write-host 'Found a Unicode start byte $ns='$ns ' $n='$n
for($c=1;$c -le $ns; $c++)
{
$i++; $total_cnt++; #remember to increment the main loop index into #line
$s=$line.Substring($i,1) #read the next string char
$n=[int][CHAR]$s #convert to int
if($c -eq 1)
{
if( (($n -band 128) -eq 128) -and (($n -band 64) -ne 0) )
{
write-host 'NOT A CONTINUE BIT $ns='$ns
}
$n=$n-128 #reset the 8th bit
$b7 = $n -band 64; if($b7 -eq 64){$n=$n-64} #remove the contorl bits
}
$t=[convert]::ToString($n,'0') #convert int to hex
$bin=$bin+ [convert]::tostring($n,2)
$number = [Convert]::ToInt32($bin,2) #conver to int
$hex = [convert]::ToString($number,16).PadLeft(4,'0')
write-host '$s='$s ' $n='$n ' $t='$t ' $bin='$bin ' $hex='$hex
}
$uc=''
if($ns -eq 0){write-host 'SINGLE BYTE'; Read-Host 'ENTER';}
ELSE{ $uni='\u'+$hex; $uc = [regex]::Unescape($uni) }
write-host 'FINAL: Unicode is: '$uc
read-host "press ENTER to find and process next unicode character"
}
}