如何从 Python 抓取的 URL 列表中的 URL 抓取数据?

问题描述

我正在尝试在 Orange 中使用 BeautifulSoup4 从从同一网站抓取的 URL 列表中抓取数据。

当我手动设置 URL 时,我设法从单个页面中抓取了数据。

from urllib.request import urlopen
from bs4 import BeautifulSoup
import requests
import csv
import re

url = "https://data.ushja.org/awards-standings/zone-points.aspx?year=2021&zone=1&section=1901"
req = requests.get(url)
soup = BeautifulSoup(req.text,"html.parser")

rank = soup.find("table",class_="table-standings-body")
for child in rank.children:
    print(url,child)

并且我已经能够抓取我需要的 URL 列表

from urllib.request import urlopen
from bs4 import BeautifulSoup
import requests
import csv
import re

url = "https://data.ushja.org/awards-standings/zones.aspx?year=2021&zone=1"
req = requests.get(url)
soup = BeautifulSoup(req.text,class_="table-standings-body")

link = soup.find('div',class_='contentSection')

url_list = link.find('a').get('href')
for url_list in link.find_all('a'):
    print (url_list.get('href'))

但到目前为止,我还无法将两者结合起来从该 URL 列表中抓取数据。我只能通过嵌套 for 循环来做到这一点,如果可以,怎么做?或者我该怎么做?

如果这是一个愚蠢的问题,我很抱歉,但我昨天才开始尝试使用 Python 和 Web-Scraping,我无法通过咨询类似的主题解决这个问题。

解决方法

试试:

import requests
import pandas as pd
from bs4 import BeautifulSoup

url = "https://data.ushja.org/awards-standings/zones.aspx?year=2021&zone=1"
req = requests.get(url)
soup = BeautifulSoup(req.text,"html.parser")

# get all links
url_list = []
for a in soup.find("div",class_="contentSection").find_all("a"):
    url_list.append(a["href"].replace("§","&sect"))

# get all data from URLs
all_data = []
for url in url_list:
    print(url)

    req = requests.get(url)
    soup = BeautifulSoup(req.text,"html.parser")

    h2 = soup.h2
    sub = h2.find_next("p")

    for tr in soup.select("tr:has(td)"):
        all_data.append(
            [
                h2.get_text(strip=True),sub.get_text(strip=True),*[td.get_text(strip=True) for td in tr.select("td")],]
        )

# save data to CSV
df = pd.DataFrame(
    all_data,columns=[
        "title","sub_title","Rank","Horse / Owner","Points","Total Comps",],)
print(df)
df.to_csv("data.csv",index=None)

这会遍历所有 URL 并将所有数据保存到 data.csv(来自 LibreOffice 的屏幕截图):

enter image description here