BeautifulSoup 和 List 索引

问题描述

由于我对网络抓取还很陌生,我目前正在练习一些基础知识,例如这个。我从“th”标签和“tr”标签中抓取了类别,并将其附加到几个空列表中。 get_text() 的分类结果很好,但是当我尝试打印球员时,它在名字的第一个字母之前有一个数字排名,在姓氏之后是球员的球队缩写字母。 我正在尝试做的 3 件事:

1)通过从列表中进行一些切片,仅输出每个玩家的名字和姓氏,但我想不出任何更简单的方法来做到这一点。标签内可能有一种更快的方法,我可以在其中调用类或在 html 中再次使用 soup.findAll 或其他我不知道的东西,但我目前不知道我是如何或缺少什么的。

2) 取名字前面的数字等级并将其附加到一个空列表中。

3) 取最后 3 个缩写字母并将其附加到一个空列表中

任何建议将不胜感激!

from bs4 import BeautifulSoup as bs4
import requests
import pandas as pd
from time import sleep

players = []
categories = []

url ='https://www.espn.com/nba/stats/player/_/table/offensive/sort/avgPoints/dir/desc'
source = requests.get(url)
soup = bs4(source.text,'lxml')

for i in soup.findAll('th'):
    c = i.get_text()
    categories.append(c)

for i in soup.findAll('tr'):
    player = i.get_text()
    players.append(player)

players = players[1:51]

print(categories)
print(players)

解决方法

在我看来,API 始终是最好的选择。

然而,这也可以通过 pandas .read_html() 来完成(它在底层使用了 beautifulsoup 来解析表)。

import pandas as pd

url = 'https://www.espn.com/nba/stats/player/_/table/offensive/sort/avgPoints/dir/desc'

dfs = pd.read_html(url)
dfs[0][['Name','Team']] = dfs[0]['Name'].str.extract('^(.*?)([A-Z]+)$',expand=True)
df = dfs[0].join(dfs[1])

输出:

print (df[['RK','Name','Team','POS']])

    RK                     Name  Team POS
0    1             James Harden   HOU  SG
1    2            Stephen Curry    GS  PG
2    3             Bradley Beal   WSH  SG
3    4               Trae Young   ATL  PG
4    5             Kevin Durant   BKN  SF
5    6              CJ McCollum   POR  SG
6    7             Kyrie Irving   BKN  PG
7    8             Jaylen Brown   BOS  SG
8    9    Giannis Antetokounmpo   MIL  PF
9   10             Jayson Tatum   BOS  PF
10  11           Damian Lillard   POR  PG
11  12              Luka Doncic   DAL  PG
12  13            Collin Sexton   CLE  PG
13  14              Paul George   LAC  SG
14  15           Brandon Ingram    NO  SF
15  16             Nikola Jokic   DEN   C
16  17             LeBron James   LAL  SF
17  18              Zach LaVine   CHI  SG
18  19           Christian Wood   HOU  PF
19  20            Kawhi Leonard   LAC  SF
20  21              Joel Embiid   PHI   C
21  22             Jerami Grant   DET  PF
22  23            Anthony Davis   LAL  PF
23  24             Jamal Murray   DEN  PG
24  25            Julius Randle    NY  PF
25  26          Malcolm Brogdon   IND  PG
26  27            Fred VanVleet   TOR  SG
27  28           Nikola Vucevic   ORL   C
28  28         Donovan Mitchell  UTAH  SG
29  30             Terry Rozier   CHA  PG
30  31             Devin Booker   PHX  SG
31  32          Khris Middleton   MIL  SF
32  33            Terrence Ross   ORL  SG
33  33           Victor Oladipo   IND  SG
34  35        Russell Westbrook   WSH  PG
35  36         Domantas Sabonis   IND  PF
36  36             De'Aaron Fox   SAC  PG
37  38          Zion Williamson    NO  SF
38  39            Tobias Harris   PHI  SF
39  40              Bam Adebayo   MIA   C
40  41            DeMar DeRozan    SA  SG
41  41         D'Angelo Russell   MIN  SG
42  43           Gordon Hayward   CHA  SF
43  44               Kyle Lowry   TOR  PG
44  44  Shai Gilgeous-Alexander   OKC  SG
45  46              Mike Conley  UTAH  PG
46  47            Malik Beasley   MIN  SG
47  48               RJ Barrett    NY  SG
48  49            Thomas Bryant   WSH   C
49  50            Pascal Siakam   TOR  PF
,

这是你想要的吗?

import requests
from bs4 import BeautifulSoup
from tabulate import tabulate

url = "https://www.espn.com/nba/stats/player/_/table/offensive/sort/avgPoints/dir/desc"
soup = BeautifulSoup(requests.get(url).text,"html.parser")

table_data = [
    [r // 2,i.find("a").getText(),i.find("span").getText()] for r,i in
    enumerate(soup.find_all("td",class_="Table__TD"),start=1)
    if i.find("a") and i.find("span")
]

print(tabulate(table_data,headers=["Rank","Name","Team"],tablefmt="pretty"))

输出:

| Rank |          Name           | Team |
+------+-------------------------+------+
|  1   |      James Harden       | HOU  |
|  2   |      Stephen Curry      |  GS  |
|  3   |      Bradley Beal       | WSH  |
|  4   |       Trae Young        | ATL  |
|  5   |      Kevin Durant       | BKN  |
|  6   |       CJ McCollum       | POR  |
|  7   |      Kyrie Irving       | BKN  |
|  8   |      Jaylen Brown       | BOS  |
|  9   |  Giannis Antetokounmpo  | MIL  |
|  10  |      Jayson Tatum       | BOS  |
|  11  |     Damian Lillard      | POR  |
|  12  |       Luka Doncic       | DAL  |
|  13  |      Collin Sexton      | CLE  |
|  14  |       Paul George       | LAC  |
|  15  |     Brandon Ingram      |  NO  |
|  16  |      Nikola Jokic       | DEN  |
|  17  |      LeBron James       | LAL  |
|  18  |       Zach LaVine       | CHI  |
|  19  |     Christian Wood      | HOU  |
|  20  |      Kawhi Leonard      | LAC  |
|  21  |       Joel Embiid       | PHI  |
|  22  |      Jerami Grant       | DET  |
|  23  |      Anthony Davis      | LAL  |
|  24  |      Jamal Murray       | DEN  |
|  25  |      Julius Randle      |  NY  |
|  26  |     Malcolm Brogdon     | IND  |
|  27  |      Fred VanVleet      | TOR  |
|  28  |     Nikola Vucevic      | ORL  |
|  29  |    Donovan Mitchell     | UTAH |
|  30  |      Terry Rozier       | CHA  |
|  31  |      Devin Booker       | PHX  |
|  32  |     Khris Middleton     | MIL  |
|  33  |      Terrence Ross      | ORL  |
|  34  |     Victor Oladipo      | IND  |
|  35  |    Russell Westbrook    | WSH  |
|  36  |    Domantas Sabonis     | IND  |
|  37  |      De'Aaron Fox       | SAC  |
|  38  |     Zion Williamson     |  NO  |
|  39  |      Tobias Harris      | PHI  |
|  40  |       Bam Adebayo       | MIA  |
|  41  |      DeMar DeRozan      |  SA  |
|  42  |    D'Angelo Russell     | MIN  |
|  43  |     Gordon Hayward      | CHA  |
|  44  |       Kyle Lowry        | TOR  |
|  45  | Shai Gilgeous-Alexander | OKC  |
|  46  |       Mike Conley       | UTAH |
|  47  |      Malik Beasley      | MIN  |
|  48  |       RJ Barrett        |  NY  |
|  49  |      Thomas Bryant      | WSH  |
|  50  |      Pascal Siakam      | TOR  |
+------+-------------------------+------+
,

总是问你 - 有没有更简单的方法?

是的,你应该去:)

如果您想抓取,首先看看您是否真的需要从网站抓取内容,或者是否有 api 提供结构良好的信息。

请求 api 的示例

import requests
import pandas as pd

url = "https://site.web.api.espn.com/apis/common/v3/sports/basketball/nba/statistics/byathlete?region=us&lang=en&contentorigin=espn&isqualified=true&page=1&limit=50&sort=offensive.avgPoints%3Adesc&season=2021&seasontype=2"
headers = {"user-agent": "Mozilla/5.0"}

response = requests.get(url,headers=headers)
response.raise_for_status()

ranking=[]

for i,player in enumerate(response.json()['athletes'],start=1):
    rank = i
    name = player['athlete']['displayName']
    team = player['athlete']['teamShortName']
    category = player['athlete']['position']['abbreviation']
    ranking.append({'rank':rank,'name':name,'team':team,'category':category})

df = pd.DataFrame(ranking)
df

输出数据框

rank                     name  team category
    1             James Harden   HOU       SG
    2            Stephen Curry    GS       PG
    3             Bradley Beal   WSH       SG
    4               Trae Young   ATL       PG
    5             Kevin Durant   BKN       SF
    6              CJ McCollum   POR       SG
    7             Kyrie Irving   BKN       PG
    8             Jaylen Brown   BOS       SG
    9    Giannis Antetokounmpo   MIL       PF
   10             Jayson Tatum   BOS       PF

但是要回答你的问题

您也可以使用 BeautifulSoup 来实现,但我认为它更容易出错:

from bs4 import BeautifulSoup
import requests
import pandas as pd

data = []

url ='https://www.espn.com/nba/stats/player/_/table/offensive/sort/avgPoints/dir/desc'
source = requests.get(url)
soup = BeautifulSoup(source.text,'lxml')

for i in soup.select('tr')[1:]:
    if i.select_one('td'):
        rank = i.select_one('td').get_text()
    if i.select_one('div > a'):
        player = i.select_one('div > a').get_text()
    if i.select_one('div > span'):
        team =i.select_one('div > span').get_text()
                  
    data.append({'rank':rank,'player':player,'team':team})

pd.DataFrame(data)

如果你不想使用 css 选择器,你也可以这样做

for i in soup.find_all('tr')[1:]:
    if i.find('td'):
        rank = i.find('td').get_text()
    if i.find('a'):
        player = i.find('a').get_text()
    if i.find('span'):
        team =i.find('span').get_text()