取消在blockquote bs4之后的文本

问题描述

我在HTML中有类似的内容：

<p align="left"><strong><tt>
        some text:</tt></strong><tt> (8/4)</tt><a href="some link"><tt>some other text</tt></a><tt>,(9/4)</tt><a href="some other link"><tt><br/>
        some text:</tt></strong><tt>,(19/6)</tt><!--a href="some link in comment"--><tt>text after comment</tt></p></blockquote></blockquote><tt>,</tt><a href="link i want"><tt>text i want</tt></a><strong><tt><br/>
...
</p>

我在Python中的代码：

page = requests.get(site)
soup = BeautifulSoup(page.content,'html.parser')
rounds = soup.find('p',align="left")
matches_links = rounds.find_all('a')

我将获得所有指向“ SOME COMMENT”的链接和文本。 </blockquote></blockquote>之后我什么也收不到。这两个块引用在页面代码中是不可见的，只有在调试Python代码时，我才能在soup中看到它。在soup中，我具有所有HTML代码，但是在rounds中，代码以<tt>text after comment</tt></p>结尾。

有什么方法可以获取“我想要的链接”和“我想要的文字”？

解决方法

如果查看HTML代码，则会发现</p>之前有</blockquote></blockquote>。这意味着您的变量rounds不包含您想要的链接。在此<a>标记之后搜索下一个<p>：

from bs4 import BeautifulSoup


txt = '''
<p align="left"><strong><tt>
        some text:</tt></strong><tt> (8/4)</tt><a href="some link"><tt>some other text</tt></a><tt>,(9/4)</tt><a href="some other link"><tt><br/>
        some text:</tt></strong><tt>,(19/6)</tt><!--a href="some link in comment"--><tt>text after comment</tt></p></blockquote></blockquote><tt>,</tt><a href="link i want"><tt>text i want</tt></a><strong><tt><br/>
...
</p>
'''

soup = BeautifulSoup(txt,'html.parser')

matched_link = soup.select_one('p[align="left"] ~ a')
print(matched_link)

打印：

<a href="link i want"><tt>text i want</tt></a>

beautifulsoup blockquote python web-scraping