使用BeautifulSoup从img标签提取src url

问题描述

我正在尝试获取img src的URL部分。我想提取以下URL:https://images-na.ssl-images-amazon.com/images/I/41YEd80s6SL._SX384_BO1,200_.jpg

返回以下是我认为是编码图像的图像吗?

数据:图像/ JPEG; BASE64,/ 9J / 4AAQSkZJRgABAQAAAQABAAD / 2wCEABYWGBQYFBwaFhwYHBocIiceGBwgLjg0JzAlNiwsIjYsJTAlIzIsMDouNjA + TkBJPjpnUERYLkRHelJ8ZoZaUnYBDhoYGiAiGh4eIiIeICciRTAgHlIyNDgiSRQ4Hic2Jyk4HCcuMhwpPClJFj4eFFQ6RzIjRScgHiM2JxowNFY2Ov / AABEIARwA3AMBIgACEQEDEQH ....

我没有全部添加,因为它超过600行。

这是我的代码:

from bs4 import BeautifulSoup
import requests

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; 64) AppleWebKit/537.36 (KHTML,like Gecko) Chrome/58.0.3029.110 Safari/537.3'}

url = "https://www.amazon.co.uk/Django-Professionals-Production-websites-Python/dp/1081582162/ref=sr_1_1?dchild=1&keywords=django+for+professionals&qid=1597167266&sr=8-1"
resp = requests.get(url,headers=headers)
soup = BeautifulSoup(resp.content,features="lxml")
product_title = soup.select("#productTitle")[0].get_text().strip()
author = soup.select(".contributorNameID")[0].get_text().strip()

images = soup.findAll('img')
for image in images:
    print (image['src'])

编辑:其他img src似乎返回了网址,但不是我专门针对的网址。

解决方法

我相信您可以这样做:

encoded_image = base64.b64decode(image['src'])
,

要提取https://images-na.ssl-images-amazon.com/images/I/41YEd80s6SL._SX384_BO1,204,203,200_.jpg图像,可以解析data-a-dynamic-image属性:

import json
import requests
from bs4 import BeautifulSoup

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; 64) AppleWebKit/537.36 (KHTML,like Gecko) Chrome/58.0.3029.110 Safari/537.3'}

url = "https://www.amazon.co.uk/Django-Professionals-Production-websites-Python/dp/1081582162/ref=sr_1_1?dchild=1&keywords=django+for+professionals&qid=1597167266&sr=8-1"
resp = requests.get(url,headers=headers)
soup = BeautifulSoup(resp.content,features="lxml")
product_title = soup.select("#productTitle")[0].get_text().strip()
author = soup.select(".contributorNameID")[0].get_text().strip()

images = soup.find_all('img',src=lambda s: 'data:' in s)
for image in images:
    for img in json.loads(image['data-a-dynamic-image']):
        print(img)

打印:

https://images-na.ssl-images-amazon.com/images/I/41YEd80s6SL._SX384_BO1,200_.jpg
https://images-na.ssl-images-amazon.com/images/I/41YEd80s6SL._SX258_BO1,200_.jpg

相关问答

依赖报错 idea导入项目后依赖报错,解决方案:https://blog....
错误1:代码生成器依赖和mybatis依赖冲突 启动项目时报错如下...
错误1:gradle项目控制台输出为乱码 # 解决方案:https://bl...
错误还原:在查询的过程中,传入的workType为0时,该条件不起...
报错如下,gcc版本太低 ^ server.c:5346:31: 错误:‘struct...