尝试读取 json 文件时编码不会切换

问题描述

我有一个 json 文件 file.json 编码的 KOI8-R。

Boost Json 仅适用于 UTF-8 编码,因此我将文件从 KOI8-R 转换为 UTF-8:

boost::property_tree::ptree tree;

std::locale loc = boost::locale::generator().generate(ru_RU.UTF-8);
std::ifstream ifs("file.json",std::ios::binary);
ifs.imbue(loc)

boost::property_tree::read_json(ifs,tree);

但是,文件无法读取..我做错了什么?

更新:

我编写了一个 JSON 文件“test.txt”:

{
    "соплодие": "лысеющий","обсчитавший": "перегнавший","кариозный": "отдёргивающийся","суверенен": "носившийся","рецидивизм": "поляризуются"
}

并将其保存在 koi8-r 中。

我有一个代码

#include <boost/property_tree/ptree.hpp>
#include <boost/property_tree/json_parser.hpp>

int main() {
    boost::property_tree::ptree pt;
    boost::property_tree::read_json("test.txt",pt);
}

编译,运行并得到以下错误

terminate called after throwing an instance of 'boost::wrapexcept<boost::property_tree::json_parser::json_parser_error>'
  what():  test.txt(2): invalid code sequence
Aborted (core dumped)

然后我使用 boost 语言环境:

#include <boost/property_tree/ptree.hpp>
#include <boost/property_tree/json_parser.hpp>

#include <boost/locale/generator.hpp>
#include <boost/locale/encoding.hpp>


int main() {
    std::locale loc = boost::locale::generator().generate("ru_RU.utf8");
    std::ifstream ifs("test.txt",std::ios::binary);
    ifs.imbue(loc);
    
    boost::property_tree::ptree pt;
    boost::property_tree::read_json(ifs,pt);
}

编译(g++ main.cpp -lboost_locale),运行并得到以下错误

terminate called after throwing an instance of 'boost::wrapexcept<boost::property_tree::json_parser::json_parser_error>'
  what():  <unspecified file>(2): invalid code sequence
Aborted (core dumped)

解决方法

JSON 规范 requires UTF8

8.1.字符编码

 JSON text exchanged between systems that are not part of a closed
 ecosystem MUST be encoded using UTF-8 [RFC3629].

通用库只支持它是有意义的。请参阅此处了解更多上下文:JSON character encoding - is UTF-8 well-supported by browsers or should I use numeric escape sequences?

无论如何怎么做

也许对于 libiconv 或 libicu,Boost 语言环境支持后者。

使用 Boost Locale/ICU

这要求您的库是在 ICU 支持下构建的,并且可能(?)您拥有所需的语言环境,这很可能已经在您的系统上。

它还假设源代码是​​ UTF8 编码,这也是可能的。

Live On Compiler Explorer

#include <boost/locale.hpp>
#include <boost/locale/conversion.hpp>
#include <boost/json.hpp>
#include <boost/json/src.hpp>
#include <iostream>
#include <fstream>

namespace json = boost::json;

int main() {
    std::string koi8r = [] {
        std::ifstream ifs("input.txt",std::ios::binary);
        return std::string(std::istream_iterator<char>(ifs),{});
    }();

    json::value doc =
        json::parse(boost::locale::conv::to_utf<char>(koi8r,"KOI8-R"));

    std::cout << "Serialized back: " << doc << "\n";

    std::cout << "Extracting a single key: " << doc.as_object()["соплодие"] << "\n";
}

我编了一个随机的 JSON:

{
    "соплодие": "лысеющий","обсчитавший": "перегнавший","кариозный": "отдёргивающийся","суверенен": "носившийся","рецидивизм": "поляризуются"
}

并将其保存在 koi8-r 中为 "input.txt"

00000000: 7b0a 2020 2020 22d3 cfd0 cccf c4c9 c522  {.    "........"
00000010: 3a20 22cc d9d3 c5c0 ddc9 ca22 2c0a 2020  : "........",.  
00000020: 2020 22cf c2d3 dec9 d4c1 d7db c9ca 223a    "...........":
00000030: 2022 d0c5 d2c5 c7ce c1d7 dbc9 ca22 2c0a   "...........",.
00000040: 2020 2020 22cb c1d2 c9cf dace d9ca 223a      ".........":
00000050: 2022 cfd4 c4a3 d2c7 c9d7 c1c0 ddc9 cad3   "..............
00000060: d122 2c0a 2020 2020 22d3 d5d7 c5d2 c5ce  .",.    ".......
00000070: c5ce 223a 2022 cecf d3c9 d7db c9ca d3d1  ..": "..........
00000080: 222c 0a20 2020 2022 d2c5 c3c9 c4c9 d7c9  ",.    "........
00000090: dacd 223a 2022 d0cf ccd1 d2c9 dad5 c0d4  ..": "..........
000000a0: d3d1 220a 7d0a                           ..".}.

现在运行该程序显示:

Serialized back: {"соплодие":"лысеющий","обсчитавший":"перегнавший","кариозный":"отдёргивающий
ся","суверенен":"носившийся","рецидивизм":"поляризуются"}
Extracting a single key: "лысеющий"