为什么Python打印不正确的Unicode字符?

问题描述

我正在尝试在Instagram注释中的span标签内打印文本,但该文本已解码为Unicode。结果应为“ ubermensch112358”。根据我最近在网上阅读的内容,我应该只能够打印字符串,但是python似乎误解了Unicode,因为我得到了一堆表情符号而不是文本。我还注意到它有时会打印正确的Unicode字符。

from selenium import webdriver
from time import sleep
from bs4 import BeautifulSoup

# LINK TO POST
url = 'https://www.instagram.com/p/CDxso14nbF9JI1Rds7_gJ5ECzZat-AA5LiXUKM0/'

# Calling webdriver and putting the file path to where I have 
# chromedriver located
driver = webdriver.Chrome('/Users/brown/chromedriver')

driver.get(url)
sleep(2)

html = driver.page_source
soup2 = BeautifulSoup(html,'html.parser')

comnt_html = soup2.find(class_='XQXOT')
comntr_parent_html = comnt_html.findAll('ul',class_='Mr508')

# Added a counter to make troubleshooting easier.
counter = 1

for child in comntr_parent_html:
    comntr_child_html = child.find(class_='C4VMK')
    to_be_trashed1 = comntr_child_html.findAll(class_='_6lAjh')
    to_be_trashed2 = comntr_child_html\
.findAll(class_='Igw0E IwRSH eGOV_ _4EzTm pjcA_ aGBdT')

# Text/username from each comment. 
    for html_class2 in to_be_trashed2:
        html_class2.decompose()
        comntr_text = comntr_child_html.get_text()
        print(comntr_text)

        print('~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~',counter)
        counter += 1

结果:

ubermensch112358?? ????? ??? ?????? ???? ?????????????.

HTML源代码

<span class="">?? ????? ??? ?????? ???? ?????????????.</span>

解决方法

代码按预期工作。这个问题似乎是由于我使用的代码编辑器,带有bash shell的Visual Studio Code(1.48.0)引起的。当我在Jupyter Notebook中运行代码时,它给了我由正确的Unicode字符组成的预期字符串。我还没有找到解决方案,但是在终端中输入chcp 65001似乎是其他解决方案。正如我上面提到的,奇怪的是,偶尔会正确显示unicode字符,因此我不知道是什么原因导致VS Code这样做。