如何删除在python3字符串对象中显示为`\uxxx`的特殊字符？

问题描述

@H_404_0@python 字符串对象如下：

The site of the old observatory in Bern \u200bis the point of origin of the CH1903 coordinate system at 46°57′08.66″N 7°26′22.50″E\ufeff / \ufeff46.9524056°N 7.4395833°E\ufeff / 46.9524056; 7.4395833.

@H_404_0@我想删除这些显示为原始 unicode 的字符 \u200b \ufeff。

解决方法

将其编码为 ascii 并忽略错误

>>> s = 'The site of the old observatory in Bern \u200bis the point of origin of the CH1903 coordinate system at 46°57′08.66″N 7°26′22.50″E\ufeff / \ufeff46.9524056°N 7.4395833°E\ufeff / 46.9524056; 7.4395833'
>>> s.encode('ascii','ignore')
b'The site of the old observatory in Bern is the point of origin of the CH1903 coordinate system at 465708.66N 72622.50E / 46.9524056N 7.4395833E / 46.9524056; 7.4395833'

要用空格替换unicode字符以保持长度不变，可以使用

#length of original string

>>> s = 'The site of the old observatory in Bern \u200bis the point of origin of the CH1903 coordinate system at 46°57′08.66″N 7°26′22.50″E\ufeff / \ufeff46.9524056°N 7.4395833°E\ufeff / 46.9524056; 7.4395833'
>>> len(s)
179

#to maintain the same length

>>> new_s = s.encode('ascii',errors='ignore').decode('utf-8')
>>> final_s = new_s + ' ' * (len(s) - len(new_s))
>>> final_s
'The site of the old observatory in Bern is the point of origin of the CH1903 coordinate system at 465708.66N 72622.50E / 46.9524056N 7.4395833E / 46.9524056; 7.4395833            '
>>> len(final_s)
179

这将最终增加额外的空间以保持长度

python-3.x python-unicode regex