问题描述
我是网络抓取的新手,所以我不完全确定在这里做什么。但我正在尝试从 this URL 中的站点提取图像:
以下是最接近工作的循环:
带解析函数的for循环
import requests
import os as os
from tqdm import tqdm
from bs4 import BeautifulSoup as bs
from urllib.parse import urljoin,urlparse
url = "https://www.legacysurvey.org/viewer/data-for-radec/?ra=55.0502&dec=-18.5790&layer=ls-dr8&ralo=55.0337&rahi=55.0655&declo=-18.5892&dechi=-18.5714"
def is_valid(url):
"""
Checks whether `url` is a valid URL.
"""
parsed = urlparse(url)
return bool(parsed.netloc) and bool(parsed.scheme)
def get_all_images(url):
"""
Returns all image URLs on a single `url`
"""
soup = bs(requests.get(url).content,"html.parser")
urls = []
for img in tqdm(soup.find_all("img"),"Extracting images"):
img_url = img.attrs.get("src")
if not img_url:
# if img does not contain src attribute,just skip
continue
os.getcwd()
While 循环 - 图片抓取
import requests
from bs4 import BeautifulSoup
# link to first page - without `page=`
url = 'https://www.legacysurvey.org/viewer/data-for-radec/?ra=55.0502&dec=-18.5799&layer=ls-dr8&ralo=55.0337&rahi=55.0655&declo=-18.5892&dechi=-18.5714'
# only for information,not used in url
page = 0
while True:
print('---',page,'---')
r = requests.get(url)
soup = BeautifulSoup(r.content,"html.parser")
# String substitution for HTML
for link in soup.find_all("img"):
print("<img href='>%s'>%s</img>" % (link.get("href"),link.text))
# Fetch and print general data from title class
general_data = soup.find_all('div',{'class' : 'title'})
for item in general_data:
print(item.contents[0].text)
print(item.contents[1].text.replace('.',''))
print(item.contents[2].text)
# link to next page
next_page = soup.find('a',{'class': 'next'})
if next_page:
url = next_page.get('href')
page += 1
else:
break # exit `while True`
我尝试将这两种方法都用于下载输出的图像链接,但我无法获得任何我尝试过的输出。非常感谢任何帮助!
解决方法
暂无找到可以解决该程序问题的有效方法,小编努力寻找整理中!
如果你已经找到好的解决方法,欢迎将解决方案带上本链接一起发送给小编。
小编邮箱:dio#foxmail.com (将#修改为@)