通过 octets->string->unpack 解包二进制文件失败：signed int `#(243 0)` 是非法的 UTF8

问题描述

我正在解析一个混合了字符、浮点数、整数和短字符的二进制文件 (nifti)（使用 PDL::IO::Nifti cpan 模块作为参考）。

我很幸运地将八位字节序列解析为字符串，以便将它们传递给 cl-pack:unpack。这很复杂，但对于使用 perl 模块作为参考进行移植很方便。

此策略在将 #(243 0) 读取为二进制时失败

(setf my-problem (make-array 2
                             :element-type '(unsigned-byte 8)
                             :initial-contents #(243 0)))
(babel:octets-to-string my-problem)

从位置 0 开始的非法 :UTF-8 字符

并且，当尝试将文件读取为 char*

无法解码八位字节序列#(243 0 1 0)。

我希望有一个我还没有弄清楚的简单编码问题。尝试相反方向（打包 243 并获得八位字节）给出了一个长度为 3 的向量，我期望为 2。

(babel:string-to-octets (cl-pack:pack "s" 243))
; yields #(195 179 0) expect #(243 0)

完整的上下文

;; can read up to position 40. at which we expect 8 signed ints. 
;; 4th int is value "243" but octet cannot be parsed
(setq fid-bin (open "test.nii" :direction :input :element-type 'unsigned-byte))
(file-position fid-bin 40)
(setf seq (make-array (* 2 8) :element-type '(unsigned-byte 8)))
(read-sequence seq fid-bin) 
; seq: #(3 0 0 1 44 1 243 0 1 0 1 0 1 0 1 0)

(babel:octets-to-string seq) ; Illegal :UTF-8 character starting at position 6.
(sb-ext:octets-to-string seq) ; Illegal ....

;; first 3 are as expected
(cl-pack:unpack "s3" (babel:octets-to-string (subseq seq 0 6)))
; 3 256 300

(setf my-problem (subseq seq 6 8)) ; #(243 0)
(babel:octets-to-string my-problem)       ; Illegal :UTF-8 character starting at position 0.

;; checking the reverse direction
;; 243 gets represented as 3 bytes!?
(babel:string-to-octets (cl-pack:pack "s3" 3 256 300))     ; #(3 0 0 1 44 1)
(babel:string-to-octets (cl-pack:pack "s4" 3 256 300 243)) ; #(3 0 0 1 44 1 195 179 0)


(setq fid-str (open "test.nii" :direction :input))
(setf char-seq (make-array (* 2 8) :initial-element nil :element-type 'char*))
(file-position fid-str 40)
(read-sequence char-seq fid-str)
;; :UTF-8 stream decoding error on #<SB-SYS:FD-STREAM ....
;; the octet sequence #(243 0 1 0) cannot be decoded.

perl 等价物

open my $f,"test.nii";
seek $f,46,0;
read $f,my $b,2;
print(unpack "s",$b); # 243

解决方法

问题在于您使用的函数试图将一些八位字节序列视为字符序列（或某些 Unicode 事物：我认为还有其他事物）的编码的表示Unicode 中的字符）。特别是，在您的情况下，您使用的函数将八位字节序列视为某些字符串的 UTF-8 编码。好吧，并非所有的八位字节序列都是合法的 UTF-8，因此这些函数正确地是在处理非法的八位字节序列。

但那是因为您没有做正确的事情：您想要做的是采用八位字节序列并制作一个字符串，其 char-code 是这些八位字节。您不想处理任何愚蠢的编码大字符小整数垃圾，因为您永远不会看到任何大字符。您想要类似这些函数的东西（两者都有些错误命名，因为除非您是，否则它们不会对整个八位字节的事情大惊小怪）。

(defun stringify-octets (octets &key 
                                (element-type 'character)
                                (into (make-string (length octets)
                                                   :element-type element-type)))
  ;; Smash a sequence of octets into a string.
  (map-into into #'code-char octets))

(defun octetify-string (string &key
                               (element-type `(integer 0 (,char-code-limit)))
                               (into (make-array (length string)
                                                 :element-type element-type)))
  ;; smash a string into an array of 'octets' (not actually octets)
  (map-into into #'char-code string))

现在您可以检查一切正常：

> (octetify-string (pack "s" 243))
#(243 0)

>  (unpack "s" (stringify-octets (octetify-string (pack "s" 243))))
243

等等。鉴于您的示例序列：

> (unpack "s8" (stringify-octets #(3 0 0 1 44 1 243 0 1 0 1 0 1 0 1 0)))
3
256
300
243
1
1
1
1

一个更好的方法是让打包和解包函数简单地处理八位字节序列。但我怀疑这是一个失败的原因。一种比将八位字节序列转换为字符更可怕但不那么可怕的临时方法是将文件作为文本读取，但使用完全不进行翻译的外部格式。如何做到这一点取决于实现（但基于 latin-1 的东西将是一个好的开始）。

看来问题确实与编码有关：

CL-USER> (cl-pack:pack "s" 243)
"ó\0"

与以下结果相同：

(babel:octets-to-string my-problem :encoding :iso-8859-1)

binaryfiles common-lisp