Python –需要帮助，以CSV格式存储<img> src，从CSV列表下载图像

问题描述

我需要帮助。

当前，此代码从所需页面上的所有获取所有src属性，将URL存储在csv文件中（杂乱的https://i.imgur.com/w1slgf6.png），并从第一个URL下载第一个图像。

这太好了，但是我想下载所有照片，而不仅仅是第一张。（并希望从代码中清除该CSV文件）

旁注：：我知道我不需要创建CSV即可下载图像。我的目标是将所有img URL存储到CSV中，然后从CSV中的URL下载图像

任何帮助！

from bs4 import BeautifulSoup
from time import sleep
import urllib.request
import pandas as pd
import requests
import urllib
import base64
import csv
import time





# Get site
headers = {
    'Access-Control-Allow-Origin': '*','Access-Control-Allow-Methods': 'GET','Access-Control-Allow-Headers': 'Content-Type','Access-Control-Max-Age': '3600','User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0'
    }
page = driver.page_source
soup = BeautifulSoup(page)
# Gets srcs from all <img> from site 
srcs = [img['src'] for img in soup.findAll('img')]




# BELOW code = Writer writes all urls WITH comma after them

print ('Downloading URLs to file')
sleep(1)
with open('output.csv','w',newline='\n',encoding='utf-8') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow(srcs)



# Below is the code that only downloads the image from the first url. I intend for the code to download all images from all urls

print ('Downloading images to folder')
sleep(1)

filename = "output"

with open("{0}.csv".format(filename),'r') as csvfile:
    # iterate on all lines
    i = 0
    for line in csvfile:
        splitted_line = line.split(',')
        # check if we have an image URL
        if splitted_line[1] != '' and splitted_line[1] != "\n":
            urllib.request.urlretrieve(splitted_line[1],"img_" + str(i) + ".png")
            print ("Image saved for {0}".format(splitted_line[0]))
            i += 1
        else:
            print ("No result for {0}".format(splitted_line[0]))

解决方法

这是CSVsless解决方案：

import os
import requests
import urllib.request
from bs4 import BeautifulSoup

page = requests.get('https://igromania.ru').text
soup = BeautifulSoup(page)
tags = soup.findAll('img')

for tag in tags:
    url = tag['src']
    try:
        urllib.request.urlretrieve(url,os.path.basename(url))
        print(f'Image downloaded: {url}')
    except ValueError:
        print(f'Error downloading: {url}')

样本输出：

Error downloading: //cdn.igromania.ru/-Engine-/SiteTemplates/igromania/images/logo_mania.png
Image downloaded: https://cdn.igromania.ru/mnt/mainpage_promo/b/8/b/2904/preview/3d0a4043f5dfd3e9443ce0b27d2a8329_400x225.jpg
Image downloaded: https://cdn.igromania.ru/mnt/mainpage_promo/7/c/7/3124/preview/8df8f4505157e4928187b5450c03e82b_400x225.jpg
Image downloaded: https://cdn.igromania.ru/mnt/mainpage_promo/c/6/8/2912/preview/4a70f416181b77f6b543053ea8e5d300_400x225.jpg
Image downloaded: https://cdn.igromania.ru/mnt/mainpage_promo/2/e/0/3123/preview/0eb2f280f1b9e089d5a12bc0df1120bc_400x225.jpg
Image downloaded: https://cdn.igromania.ru/mnt/mainpage_promo/c/9/2/3130/preview/29e962c5444f67fa95b3714c7ae7683f_400x225.jpg

这是保留CSV的另一种解决方案。

from bs4 import BeautifulSoup
from time import sleep
import urllib.request
import pandas as pd
import requests
import urllib
import base64
import csv
import time


# Get site
headers = {
    'Access-Control-Allow-Origin': '*','Access-Control-Allow-Methods': 'GET','Access-Control-Allow-Headers': 'Content-Type','Access-Control-Max-Age': '3600','User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0'
    }
    
#page = driver.page_source
page = "https://unsplash.com/"
r = requests.get(page)
soup = BeautifulSoup(r.text,"html.parser")
# Gets srcs from all <img> from site 
srcs = [img['src'] for img in soup.findAll('img')]

# BELOW code = Writer writes all urls WITH comma after them

print ('Downloading URLs to file')
sleep(1)
with open('output.csv','w',newline='\n',encoding='utf-8') as csvfile:
#    writer = csv.writer(csvfile)
    for i,s in enumerate(srcs):  # each image number and URL
       csvfile.write(str(i) +','+s+'\n')

# Below is the code that only downloads the image from the first url. I intend for the code to download all images from all urls

print ('Downloading images to folder')
sleep(1)

filename = "output"

with open("{0}.csv".format(filename),'r') as csvfile:
    # iterate on all lines
    i = 0
    for line in csvfile:
        splitted_line = line.split(',')
        # check if we have an image URL
        if splitted_line[1] != '' and splitted_line[1] != "\n":
            urllib.request.urlretrieve(splitted_line[1],"img_" + str(i) + ".png")
            print ("Image saved for {0}".format(splitted_line[0]))
            i += 1
        else:
            print ("No result for {0}".format(splitted_line[0]))

输出（output.csv）

0,https://sb.scorecardresearch.com/p?c1=2&c2=32343279&cv=2.0&cj=1
1,https://images.unsplash.com/photo-1597523565663-916cf059f524?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format%2Ccompress&fit=crop&w=1000&h=1000
2,https://images.unsplash.com/profile-1574526450714-e5d331168827image?auto=format&fit=crop&w=32&h=32&q=60&crop=faces&bg=fff
3,https://images.unsplash.com/photo-1599687350404-88b32c067289?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&w=1000&q=80
4,https://images.unsplash.com/profile-1583427783052-3da8ceab5579image?auto=format&fit=crop&w=32&h=32&q=60&crop=faces&bg=fff
5,https://images.unsplash.com/photo-1600181957705-92f267a2740e?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&w=1000&q=80
6,https://images.unsplash.com/profile-1545567671893-842f479b15e2?auto=format&fit=crop&w=32&h=32&q=60&crop=faces&bg=fff
7,https://images.unsplash.com/photo-1600187723541-04457a98cc47?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&w=1000&q=80
8,https://images.unsplash.com/photo-1599687350404-88b32c067289?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&w=1000&q=80
9,https://images.unsplash.com/photo-1600181957705-92f267a2740e?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&w=1000&q=80
10,https://images.unsplash.com/photo-1600187723541-04457a98cc47?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&w=1000&q=80

csv csv pandas python python-requests urllib