仅在渲染时获得数据URL的访问数据图像URL

问题描述

我想在页面渲染后使用相应的data URLs自动获取保存为浏览器数据的图像。

例如：

您可以转到以下网页：https://en.wikipedia.org/wiki/Truck
使用Firefox中的WebInspector选择右侧的第一张缩略图。
现在在“检查器”选项卡上，右键单击img标签，转到“复制”，然后按“图像数据URL”
打开一个新标签，粘贴并输入以查看数据URL中的图像。

请注意，数据URL在页面源上不可用。在我要抓取的网站上，图像通过php脚本后呈现。如果尝试使用src标签属性直接访问图像，则服务器将返回404响应。

我相信应该可以列出网站渲染的图像的数据URL并下载它们，但是我找不到解决方法。

我通常使用selenium webdriver和python编码的Firefox进行抓取，但是欢迎任何解决方案。

解决方法

BeautifulSoup是用于此类问题陈述的最佳库。当您想从任何网站检索数据时，您可以盲目使用BeautifulSoup，因为它比selenium更快。 BeautifulSoup大约需要10秒才能完成此任务，而selenium大约需要15-20秒才能完成同一任务，因此最好使用BeautifulSoup。这是您使用BeautifulSoup的方式：

from bs4 import BeautifulSoup
import requests 
import time 

st = time.time()

src = requests.get('https://en.wikipedia.org/wiki/Truck').text

soup = BeautifulSoup(src,'html.parser')

divs = soup.find_all('div',class_ = "thumbinner")

count = 1 

for x in divs:
    url = x.a.img['srcset']
    url = url.split('1.5x,')[-1]
    url = url.split('2x')[0]
    
    url = "https:" + url
    
    url = url.replace(" ","")
    
    path = f"D:\\Truck_Img_{count}.png"
    
    response = requests.get(url)

    file = open(path,"wb")

    file.write(response.content)

    file.close()
    
    count+=1 

print(f"Execution Time = {time.time()-st} seconds")

输出：

Execution Time = 9.65831208229065 seconds

29张图片。这是第一张图片：

希望这会有所帮助！

我设法使用禁用了CORS的chrome webdriver开发了一种解决方案，就像在Firefox中一样，我找不到cli参数来禁用它。

该解决方案执行一些javascript以在新的canvas元素上重绘图像，然后使用toDataURL方法获取数据url。为了保存图像，我将base64数据转换为二进制数据，并将其另存为png。

这显然解决了我的用例中的问题。

获得第一张卡车图像的代码

from binascii import a2b_base64
from selenium.webdriver.chrome.options import Options

chrome_options = Options()
chrome_options.add_argument("--disable-web-security")
chrome_options.add_argument("--disable-site-isolation-trials")

driver = webdriver.Chrome(options=chrome_options)
driver.get("https://en.wikipedia.org/wiki/Truck")

img = driver.find_element_by_xpath("/html/body/div[3]/div[3]"
                                   "/div[5]/div[1]/div[4]/div"
                                   "/a/img")
img_base64 = driver.execute_script(
    """
    const img = arguments[0];

    const canvas = document.createElement('canvas');
    const ctx = canvas.getContext('2d');
    canvas.width = img.width;
    canvas.height = img.height;
    ctx.drawImage(img,0);

    data_url = canvas.toDataURL('image/png');
    return data_url
    """,img)

binary_data = a2b_base64(img_base64.split(',')[1])
with open('image.png','wb') as save_img:
    save_img.write(binary_data)

此外，我发现您通过问题描述的过程获取的数据URL是由Firefox Web检查器根据请求生成的，因此应该无法获得数据URL列表（不在页面来源），就像我最初想到的那样。

phantomjs selenium selenium-webdriver web-scraping

仅在渲染时获得数据URL的访问数据图像URL

问题描述

解决方法

相关问答