如何在python bs4

问题描述

当我从网站抓取内容时,一些链接点在 src 标签中有 HTTP,因为我添加了此代码

from bs4 import BeautifulSoup


html = """
<div class="answer-given-body ugc-base">
  <p><img alt="" src="//d2vlcm61l7u1fs.cloudfront.net/media%2F61d%2F61d6042d-e4dd-41d9-9a5c-0ceb481ddbc9%2FPHPKFGb9B.png"/><img alt="" src="//d2vlcm61l7u1fs.cloudfront.net/media%2Fd72%2Fd72dfa6c-8e50-475a-86cf-678a04ae4606%2FPHPQZYPYo.png"/><img alt="" src="//d2vlcm61l7u1fs.cloudfront.net/media%2F4c7%2F4c775a01-8590-4b93-bc20-03d282586f95%2FPHPE7XFWI.png"/></p>
  </div>
"""

soup = BeautifulSoup(html,"html.parser")

# Select all the `img` tags
for tag in soup.select(".answer-given-body.ugc-base img"):
    tag["src"] = "https:" + tag["src"]

print(soup.prettify())

但有些链接在 src 中有 HTTP: 然后这段代码也将 HTTP 再次添加到该链接,请参阅:

from bs4 import BeautifulSoup


html = """
<div class="answer-given-body ugc-base">
  <p><img alt="" src="//d2vlcm61l7u1fs.cloudfront.net/media%2F61d%2F61d6042d-e4dd-41d9-9a5c-0ceb481ddbc9%2FPHPKFGb9B.png"/><img alt="" src="https://d2vlcm61l7u1fs.cloudfront.net/media%2Fd72%2Fd72dfa6c-8e50-475a-86cf-678a04ae4606%2FPHPQZYPYo.png"/><img alt="" src="//d2vlcm61l7u1fs.cloudfront.net/media%2F4c7%2F4c775a01-8590-4b93-bc20-03d282586f95%2FPHPE7XFWI.png"/></p>
  </div>
"""

soup = BeautifulSoup(html,"html.parser")

# Select all the `img` tags
for tag in soup.select(".answer-given-body.ugc-base img"):
    tag["src"] = "https:" + tag["src"]

print(soup.prettify())

所以我需要把 if 条件放在那里,但我不知道如何添加请帮助我谢谢

解决方法

我不知道有什么内置于 BeautifulSoup 的解决方案,但您可以简单地根据您找到的 URL 设置条件。

您可能需要注意的另一个地方是本地引用的图像,它们以 / 开头,或者只是将它们的名称“image.png”作为 src。在这些情况下,您需要附加页面的源 URL。

from bs4 import BeautifulSoup


html = """
<div class="answer-given-body ugc-base">
  <p><img alt="" src="//d2vlcm61l7u1fs.cloudfront.net/media%2F61d%2F61d6042d-e4dd-41d9-9a5c-0ceb481ddbc9%2FphpKFGb9B.png"/><img alt="" src="https://d2vlcm61l7u1fs.cloudfront.net/media%2Fd72%2Fd72dfa6c-8e50-475a-86cf-678a04ae4606%2FphpQZYPYo.png"/><img alt="" src="//d2vlcm61l7u1fs.cloudfront.net/media%2F4c7%2F4c775a01-8590-4b93-bc20-03d282586f95%2FphpE7XFWI.png"/></p>
  </div>
"""

soup = BeautifulSoup(html,"html.parser")

# Select all the `img` tags
for tag in soup.select(".answer-given-body.ugc-base img"):
    tag["src"] = tag["src"] if tag["src"].startswith("http") else "https:" + tag["src"]

print(soup.prettify())

处理源的一种建议方法是使用 urllib.parse 函数: from urllib.parse import urlparse 如解释 here 它将 url 分解为它的组件,然后您可以使用这些组件重新添加这些组件以进行请求:

>>> from urllib.parse import urlparse
>>> urlparse('//www.cwi.nl:80/%7Eguido/Python.html')
ParseResult(scheme='',netloc='www.cwi.nl:80',path='/%7Eguido/Python.html',params='',query='',fragment='')
>>> urlparse('www.cwi.nl/%7Eguido/Python.html')
ParseResult(scheme='',netloc='',path='www.cwi.nl/%7Eguido/Python.html',fragment='')
>>> urlparse('help/Python.html')
ParseResult(scheme='',path='help/Python.html',fragment='')