问题描述
我正在尝试在Instagram注释中的span标签内打印文本,但该文本已解码为Unicode。结果应为“ ubermensch112358”。根据我最近在网上阅读的内容,我应该只能够打印字符串,但是python似乎误解了Unicode,因为我得到了一堆表情符号而不是文本。我还注意到它有时会打印正确的Unicode字符。
from selenium import webdriver
from time import sleep
from bs4 import BeautifulSoup
# LINK TO POST
url = 'https://www.instagram.com/p/CDxso14nbF9JI1Rds7_gJ5ECzZat-AA5LiXUKM0/'
# Calling webdriver and putting the file path to where I have
# chromedriver located
driver = webdriver.Chrome('/Users/brown/chromedriver')
driver.get(url)
sleep(2)
html = driver.page_source
soup2 = BeautifulSoup(html,'html.parser')
comnt_html = soup2.find(class_='XQXOT')
comntr_parent_html = comnt_html.findAll('ul',class_='Mr508')
# Added a counter to make troubleshooting easier.
counter = 1
for child in comntr_parent_html:
comntr_child_html = child.find(class_='C4VMK')
to_be_trashed1 = comntr_child_html.findAll(class_='_6lAjh')
to_be_trashed2 = comntr_child_html\
.findAll(class_='Igw0E IwRSH eGOV_ _4EzTm pjcA_ aGBdT')
# Text/username from each comment.
for html_class2 in to_be_trashed2:
html_class2.decompose()
comntr_text = comntr_child_html.get_text()
print(comntr_text)
print('~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~',counter)
counter += 1
结果:
ubermensch112358?? ????? ??? ?????? ???? ?????????????.
HTML源代码:
<span class="">?? ????? ??? ?????? ???? ?????????????.</span>
解决方法
代码按预期工作。这个问题似乎是由于我使用的代码编辑器,带有bash shell的Visual Studio Code(1.48.0)引起的。当我在Jupyter Notebook中运行代码时,它给了我由正确的Unicode字符组成的预期字符串。我还没有找到解决方案,但是在终端中输入chcp 65001
似乎是其他解决方案。正如我上面提到的,奇怪的是,偶尔会正确显示unicode字符,因此我不知道是什么原因导致VS Code这样做。