问题描述
由于我对网络抓取还很陌生,我目前正在练习一些基础知识,例如这个。我从“th”标签和“tr”标签中抓取了类别,并将其附加到几个空列表中。 get_text() 的分类结果很好,但是当我尝试打印球员时,它在名字的第一个字母之前有一个数字排名,在姓氏之后是球员的球队缩写字母。 我正在尝试做的 3 件事:
1)通过从列表中进行一些切片,仅输出每个玩家的名字和姓氏,但我想不出任何更简单的方法来做到这一点。标签内可能有一种更快的方法,我可以在其中调用类或在 html 中再次使用 soup.findAll 或其他我不知道的东西,但我目前不知道我是如何或缺少什么的。
2) 取名字前面的数字等级并将其附加到一个空列表中。
3) 取最后 3 个缩写字母并将其附加到一个空列表中
任何建议将不胜感激!
from bs4 import BeautifulSoup as bs4
import requests
import pandas as pd
from time import sleep
players = []
categories = []
url ='https://www.espn.com/nba/stats/player/_/table/offensive/sort/avgPoints/dir/desc'
source = requests.get(url)
soup = bs4(source.text,'lxml')
for i in soup.findAll('th'):
c = i.get_text()
categories.append(c)
for i in soup.findAll('tr'):
player = i.get_text()
players.append(player)
players = players[1:51]
print(categories)
print(players)
解决方法
在我看来,API 始终是最好的选择。
然而,这也可以通过 pandas
.read_html()
来完成(它在底层使用了 beautifulsoup 来解析表)。
import pandas as pd
url = 'https://www.espn.com/nba/stats/player/_/table/offensive/sort/avgPoints/dir/desc'
dfs = pd.read_html(url)
dfs[0][['Name','Team']] = dfs[0]['Name'].str.extract('^(.*?)([A-Z]+)$',expand=True)
df = dfs[0].join(dfs[1])
输出:
print (df[['RK','Name','Team','POS']])
RK Name Team POS
0 1 James Harden HOU SG
1 2 Stephen Curry GS PG
2 3 Bradley Beal WSH SG
3 4 Trae Young ATL PG
4 5 Kevin Durant BKN SF
5 6 CJ McCollum POR SG
6 7 Kyrie Irving BKN PG
7 8 Jaylen Brown BOS SG
8 9 Giannis Antetokounmpo MIL PF
9 10 Jayson Tatum BOS PF
10 11 Damian Lillard POR PG
11 12 Luka Doncic DAL PG
12 13 Collin Sexton CLE PG
13 14 Paul George LAC SG
14 15 Brandon Ingram NO SF
15 16 Nikola Jokic DEN C
16 17 LeBron James LAL SF
17 18 Zach LaVine CHI SG
18 19 Christian Wood HOU PF
19 20 Kawhi Leonard LAC SF
20 21 Joel Embiid PHI C
21 22 Jerami Grant DET PF
22 23 Anthony Davis LAL PF
23 24 Jamal Murray DEN PG
24 25 Julius Randle NY PF
25 26 Malcolm Brogdon IND PG
26 27 Fred VanVleet TOR SG
27 28 Nikola Vucevic ORL C
28 28 Donovan Mitchell UTAH SG
29 30 Terry Rozier CHA PG
30 31 Devin Booker PHX SG
31 32 Khris Middleton MIL SF
32 33 Terrence Ross ORL SG
33 33 Victor Oladipo IND SG
34 35 Russell Westbrook WSH PG
35 36 Domantas Sabonis IND PF
36 36 De'Aaron Fox SAC PG
37 38 Zion Williamson NO SF
38 39 Tobias Harris PHI SF
39 40 Bam Adebayo MIA C
40 41 DeMar DeRozan SA SG
41 41 D'Angelo Russell MIN SG
42 43 Gordon Hayward CHA SF
43 44 Kyle Lowry TOR PG
44 44 Shai Gilgeous-Alexander OKC SG
45 46 Mike Conley UTAH PG
46 47 Malik Beasley MIN SG
47 48 RJ Barrett NY SG
48 49 Thomas Bryant WSH C
49 50 Pascal Siakam TOR PF
,
这是你想要的吗?
import requests
from bs4 import BeautifulSoup
from tabulate import tabulate
url = "https://www.espn.com/nba/stats/player/_/table/offensive/sort/avgPoints/dir/desc"
soup = BeautifulSoup(requests.get(url).text,"html.parser")
table_data = [
[r // 2,i.find("a").getText(),i.find("span").getText()] for r,i in
enumerate(soup.find_all("td",class_="Table__TD"),start=1)
if i.find("a") and i.find("span")
]
print(tabulate(table_data,headers=["Rank","Name","Team"],tablefmt="pretty"))
输出:
| Rank | Name | Team |
+------+-------------------------+------+
| 1 | James Harden | HOU |
| 2 | Stephen Curry | GS |
| 3 | Bradley Beal | WSH |
| 4 | Trae Young | ATL |
| 5 | Kevin Durant | BKN |
| 6 | CJ McCollum | POR |
| 7 | Kyrie Irving | BKN |
| 8 | Jaylen Brown | BOS |
| 9 | Giannis Antetokounmpo | MIL |
| 10 | Jayson Tatum | BOS |
| 11 | Damian Lillard | POR |
| 12 | Luka Doncic | DAL |
| 13 | Collin Sexton | CLE |
| 14 | Paul George | LAC |
| 15 | Brandon Ingram | NO |
| 16 | Nikola Jokic | DEN |
| 17 | LeBron James | LAL |
| 18 | Zach LaVine | CHI |
| 19 | Christian Wood | HOU |
| 20 | Kawhi Leonard | LAC |
| 21 | Joel Embiid | PHI |
| 22 | Jerami Grant | DET |
| 23 | Anthony Davis | LAL |
| 24 | Jamal Murray | DEN |
| 25 | Julius Randle | NY |
| 26 | Malcolm Brogdon | IND |
| 27 | Fred VanVleet | TOR |
| 28 | Nikola Vucevic | ORL |
| 29 | Donovan Mitchell | UTAH |
| 30 | Terry Rozier | CHA |
| 31 | Devin Booker | PHX |
| 32 | Khris Middleton | MIL |
| 33 | Terrence Ross | ORL |
| 34 | Victor Oladipo | IND |
| 35 | Russell Westbrook | WSH |
| 36 | Domantas Sabonis | IND |
| 37 | De'Aaron Fox | SAC |
| 38 | Zion Williamson | NO |
| 39 | Tobias Harris | PHI |
| 40 | Bam Adebayo | MIA |
| 41 | DeMar DeRozan | SA |
| 42 | D'Angelo Russell | MIN |
| 43 | Gordon Hayward | CHA |
| 44 | Kyle Lowry | TOR |
| 45 | Shai Gilgeous-Alexander | OKC |
| 46 | Mike Conley | UTAH |
| 47 | Malik Beasley | MIN |
| 48 | RJ Barrett | NY |
| 49 | Thomas Bryant | WSH |
| 50 | Pascal Siakam | TOR |
+------+-------------------------+------+
,
总是问你 - 有没有更简单的方法?
是的,你应该去:)
如果您想抓取,首先看看您是否真的需要从网站抓取内容,或者是否有 api
提供结构良好的信息。
请求 api 的示例
import requests
import pandas as pd
url = "https://site.web.api.espn.com/apis/common/v3/sports/basketball/nba/statistics/byathlete?region=us&lang=en&contentorigin=espn&isqualified=true&page=1&limit=50&sort=offensive.avgPoints%3Adesc&season=2021&seasontype=2"
headers = {"user-agent": "Mozilla/5.0"}
response = requests.get(url,headers=headers)
response.raise_for_status()
ranking=[]
for i,player in enumerate(response.json()['athletes'],start=1):
rank = i
name = player['athlete']['displayName']
team = player['athlete']['teamShortName']
category = player['athlete']['position']['abbreviation']
ranking.append({'rank':rank,'name':name,'team':team,'category':category})
df = pd.DataFrame(ranking)
df
输出数据框
rank name team category
1 James Harden HOU SG
2 Stephen Curry GS PG
3 Bradley Beal WSH SG
4 Trae Young ATL PG
5 Kevin Durant BKN SF
6 CJ McCollum POR SG
7 Kyrie Irving BKN PG
8 Jaylen Brown BOS SG
9 Giannis Antetokounmpo MIL PF
10 Jayson Tatum BOS PF
但是要回答你的问题
您也可以使用 BeautifulSoup
来实现,但我认为它更容易出错:
from bs4 import BeautifulSoup
import requests
import pandas as pd
data = []
url ='https://www.espn.com/nba/stats/player/_/table/offensive/sort/avgPoints/dir/desc'
source = requests.get(url)
soup = BeautifulSoup(source.text,'lxml')
for i in soup.select('tr')[1:]:
if i.select_one('td'):
rank = i.select_one('td').get_text()
if i.select_one('div > a'):
player = i.select_one('div > a').get_text()
if i.select_one('div > span'):
team =i.select_one('div > span').get_text()
data.append({'rank':rank,'player':player,'team':team})
pd.DataFrame(data)
如果你不想使用 css 选择器,你也可以这样做
for i in soup.find_all('tr')[1:]:
if i.find('td'):
rank = i.find('td').get_text()
if i.find('a'):
player = i.find('a').get_text()
if i.find('span'):
team =i.find('span').get_text()