在 Redhat 系统上格式化文件更改编码

问题描述

我有一个从 oracle 数据库中提取数据的 bash 脚本。我使用 spool 来提取数据。提取后，我通过删除和替换一些字符来格式化文件。我的问题是格式化后的文件是 ANSI 编码而不是 ut8。

使用线轴提取。文件是utf8
用 cat 和 tr 命令格式化并重定向到另一个文件。这个文件是ansi。

同样的过程在 Aix 系统上运行良好。我尝试 iconv 但它不起作用。你知道为什么编码从 utf8 变成 ansi 吗？请问怎么改？

解决方法

因此，您应该使用 ISO-8859-1 或 UTF-8。在后一种情况下，不要使用 tr，因为它（还？）不支持多字节字符，请改用 sed（例如 sed 's/deletethis//g'）。

ISO-8859-1：

export LC_CTYPE=fr_FR.ISO-8859-1
export NLS_LANG=French_France.WE8ISO8859P1

# fetch data from Oracle,emulated by the following line
echo 'âêîôû' >test.latin1 # 5 bytes (+lineend)

# perform formatting,eg:
sed 's/ê/[e-circumflex]/g' test.latin1

# or the same with hex-codes:
sed $'s/\xea/[e-circumflex]/g' test.latin1

UTF-8：

export LC_CTYPE=fr_FR.UTF-8
export NLS_LANG=French_France.AL32UTF8

# fetch data from Oracle,emulated by the following line
echo 'âêîôû' >test.utf8 # 10 bytes (+lineend)

# perform formatting,eg:
sed 's/ê/[e-circumflex]/g' test.utf8

# or the same with hex-codes:
sed $'s/\xc3\xaa/[e-circumflex]/g' test.utf8

注意：不需要转换（iconv、recode 等），只需确保 NLS_LANG 和 LC_CTYPE 兼容。（此外，您的终端（模拟器）也应相应设置；对于 PuTTY，它是配置/类别/窗口/翻译/远程字符集。）

原答案：

我不知道您执行的格式有什么问题，但这里有一种损坏 utf8 编码文本的方法：

$ echo 'ÁRVÍZTŰRŐ TÜKÖRFÚRÓGÉP' | iconv -f iso-8859-2 -t utf-8 | xxd
00000000: c381 5256 c38d 5a54 c5b0 52c5 9020 54c3  ..RV..ZT..R.. T.
00000010: 9c4b c396 5246 c39a 52c3 9347 c389 500a  .K..RF..R..G..P.

$ echo 'ÁRVÍZTŰRŐ TÜKÖRFÚRÓGÉP' | iconv -f iso-8859-2 -t utf-8 | tr -d $'\200-\237' | xxd
00000000: c352 56c3 5a54 c5b0 52c5 2054 c34b c352  .RV.ZT..R. T.K.R
00000010: 46c3 52c3 47c3 500a                      F.R.G.P.

这里的 tr -d $'\200-\237' 部分删除了一半的 utf8 序列（c381 变成了 c3，c590 变成了 c5），使文本无法使用。

ansi file-format