使用bs4在utf-8编码的页面中解析特殊字符的问题

问题描述

我正在尝试解析页面，但是遇到诸如éà等特殊字符的问题。

根据Firefox页面信息工具，页面以UTF-8编码

我的代码如下：

import bs4
import requests


url = 'https://www.registreentreprises.gouv.qc.ca/RQEntrepriseGRExt/GR/GR99/GR99A2_05A_PIU_AfficherMessages_PC/ActiEcon.html'

page = requests.get(url)

cae_obj_soup = bs4.BeautifulSoup(page.text,'lxml',from_encoding='utf-8')
list_all_domain = cae_obj_soup.find_all('th')

for element in list_all_domain:
    print(element.get_text())

输出为：

PÃªche et piÃ©geage
Exploitation forestiÃ¨re

我尝试使用iso-8859-1（法语编码）和其他一些编码更改编码，但没有成功。我读了几篇有关解析特殊字符的文章，它们基本上都表明选择正确的编码是一个问题。我是否有可能无法正确解码某些特定网页上的特殊字符，或者我做错了什么？

解决方法

暂无找到可以解决该程序问题的有效方法，小编努力寻找整理中！

如果你已经找到好的解决方法，欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@）

beautifulsoup character-encoding python