Tensorflow 无法拆分 Unicode UTF-8 字符

问题描述

我使用 context.startActivity(new Intent(context,ReportActivity.class)); 分割文本字符。当我使用英文字符时它工作正常

tf.strings.unicode_split

但是如果更改为 UTF-8 Unicode 字符，则它不像英文字符那样工作

example_texts = ['hello world']
chars = tf.strings.unicode_split(example_texts,input_encoding='UTF-8')
print(chars)

<tf.RaggedTensor [[b'h',b'e',b'l',b'o',b' ',b'w',b'r',b'd']]>

谢谢。

解决方法

来自评论

显然，字符被编码为 UTF-8。在英语中示例同样发生（字符是字节字符串 - 参见字节前缀），你似乎并不介意。要查看您尝试的波斯语字符这个：b'\xd8\xb3'.decode('utf8') == 'س'，就像b'h'.decode('utf8') == 'h'（转自 lenz）

python python-unicode tensorflow unicode