C ++和Boost：编码/解码UTF-8

问题描述

| 我正在尝试做一个非常简单的任务：将一个具有Unicode意识的wstring转换为一个以UTF8字节编码的string，然后采取相反的方式：采取一个包含UTF8字节的string并将其转换为具有Unicode意识的wstring。问题是，我需要它跨平台并且需要与Boost一起使用...我似乎无法找到一种使它起作用的方法。我一直在玩 http://www.edobashira.com/2010/03/using-boost-code-facet-for-reading-utf8.html和 http://www.boost.org/doc/libs/1_46_0/libs/serialization/doc/codecvt.html 尝试将代码转换为使用stringstream/wstringstream而不是任何文件，但似乎无济于事。例如，在Python中，它看起来像这样：

>>> u\"שלום\"
u\'\\u05e9\\u05dc\\u05d5\\u05dd\'
>>> u\"שלום\".encode(\"utf8\")
\'\\xd7\\xa9\\xd7\\x9c\\xd7\\x95\\xd7\\x9d\'
>>> \'\\xd7\\xa9\\xd7\\x9c\\xd7\\x95\\xd7\\x9d\'.decode(\"utf8\")
u\'\\u05e9\\u05dc\\u05d5\\u05dd\'

我最终追求的是：

wchar_t uchars[] = {0x5e9,0x5dc,0x5d5,0x5dd,0};
wstring ws(uchars);
string s = encode_utf8(ws); 
// s now holds \"\\xd7\\xa9\\xd7\\x9c\\xd7\\x95\\xd7\\x9d\"
wstring ws2 = decode_utf8(s);
// ws2 now holds {0x5e9,0x5dd}

我真的不想添加对ICU或本着这种精神的某种依赖……但是据我所知，Boost应该可以实现。一些示例代码将不胜感激！谢谢

解决方法

谢谢大家，但最终我还是求助于http://utfcpp.sourceforge.net/ －这是一个仅标头的库，非常轻巧且易于使用。我在这里共享一个演示代码，如果有人觉得有用的话：

inline void decode_utf8(const std::string& bytes,std::wstring& wstr)
{
    utf8::utf8to32(bytes.begin(),bytes.end(),std::back_inserter(wstr));
}
inline void encode_utf8(const std::wstring& wstr,std::string& bytes)
{
    utf8::utf32to8(wstr.begin(),wstr.end(),std::back_inserter(bytes));
}

用法：

wstring ws(L\"\\u05e9\\u05dc\\u05d5\\u05dd\");
string s;
encode_utf8(ws,s);

,注释中已经有一个boost链接，但是在几乎标准的C ++ 0x中，有ѭ10

#include <iostream>
#include <string>
#include <locale>
#include <codecvt>
int main()
{
    wchar_t uchars[] = {0x5e9,0x5dc,0x5d5,0x5dd,0};
    std::wstring_convert<std::codecvt_utf8<wchar_t>> conv;
    std::string s = conv.to_bytes(uchars);
    std::wstring ws2 = conv.from_bytes(s);
    std::cout << std::boolalpha
              << (s == \"\\xd7\\xa9\\xd7\\x9c\\xd7\\x95\\xd7\\x9d\" ) << \'\\n\'
              << (ws2 == uchars ) << \'\\n\';
}

用MS Visual Studio 2010 EE SP1或CLang ++ 2.9编译时的输出

true 
true

,Boost.Locale在Boost 1.48中发布（2011年11月15日），使转换和转换为UTF8 / 16更加容易这是文档中的一些方便示例：

string utf8_string = to_utf<char>(latin1_string,\"Latin1\");
wstring wide_string = to_utf<wchar_t>(latin1_string,\"Latin1\");
string latin1_string = from_utf(wide_string,\"Latin1\");
string utf8_string2 = utf_to_utf<char>(wide_string);

几乎和Python编码/解码一样简单:) 请注意，Boost.Locale不是仅标头库。 ,有关可处理utf8的std::string/ѭ15a的替代产品，请参见TINYUTF8。与<codecvt>结合使用时，您可以将utf8的每种编码转换为，然后再通过上述库进行处理。

++boost 编码编码解码

C ++和Boost：编码/解码UTF-8

问题描述

解决方法

相关问答