问题描述
新手来了!我正在使用 Python 3.8.3 并试图从附加的文本文件 listfile.txt
中删除标签我想提取 3 个列表 - 文章的标题、出版日期和正文并删除标签。在下面的代码中,我已经能够从标题和发布日期中删除标签。但是,我无法从正文中正确删除所有标签。在文件中,正文以标记 <div class="story-element story-element-text">
开始,并在下一个
对提取这部分文本的任何帮助将不胜感激!!文章正文为非英文文字,但所有html标签均为英文。
#opening text file which contains newspaper article information scraped off website using beautifulsoup
with open('listfile.txt','r',encoding='utf8') as my_file:
text = my_file.read()
print(text)
#removing tags and generating list of newspaper article titles
titles = re.findall('<h1.*?>(.*?)</h1>',text)
print(titles)
#removing tags and generating list of newspaper article publication dates
dates = re.findall('<div class=\"storyPageMetaData-m__publish-time__19bdV\"><span>(.*?)</span>',text)
print(dates)
#removing tags and generating list containing main text of articles. This is where the code is incorrect
bodytext= re.findall('<div class=\"story-element story-element-text\">(.*?)</div>',text)
print(bodytext)
解决方法
我认为您使用了错误的工具, 我建议您改用 bs4;我保证你会喜欢它?。
from bs4 import BeautifulSoup
raw_html = "YOUR RAW HTML"
soup = BeautifulSoup(raw_html,"html.parser")
titles = [h1_tag.text for h1_tag in soup.select('h1')]
dates = [span_tag.text for span_tag in soup.select('div.storyPageMetaData-m__publish-time__19bdV > span')]
bodytext = [div_tag.text for div_tag in soup.select('div.story-element.story-element-text')]
享受?
,我不熟悉如何在 python 中设置正则表达式,但这在 JavaScript 中有效
如果您仍想使用 RegEx,请使用它来捕获文本文件中的 h1 标签。 <h1(.*?)</h1>
``