问题描述
我试图在PGA统计信息网站上下载多个页面的HTML已有多年,所以我可以从中删除这些统计信息。但是,它不会将文件下载到目录中。我已经呆了几个小时了。谁能告诉我这是怎么回事?
import requests
from bs4 import BeautifulSoup
import os
import urllib.request
import gevent
url_stub = "http://www.pgatour.com/stats/stat.%s.%s.html" #stat id,year
category_url_stub = 'http://www.pgatour.com/stats/categories.%s.html'
category_labels = ['RPTS_INQ','ROTT_INQ','RAPP_INQ','RARG_INQ','RPUT_INQ','RSCR_INQ','RSTR_INQ','RMNY_INQ']
pga_tour_base_url = "http://www.pgatour.com"
def gather_pages(url,filename):
print(filename)
urllib.request.urlretrieve(url,filename)
def gather_html():
stat_ids = []
for category in category_labels:
category_url = category_url_stub % (category)
page = requests.get(category_url)
html = BeautifulSoup(page.text.replace('\n',''),'html.parser')
for table in html.find_all("div",class_="table-content"):
for link in table.find_all("a"):
stat_ids.append(link['href'].split('.')[1])
starting_year = 2015 #page in order to see which years we have info for
for stat_id in stat_ids:
url = url_stub % (stat_id,starting_year)
page = requests.get(url)
html = BeautifulSoup(page.text.replace('\n','html.parser')
stat = html.find("div",class_="parsys mainParsys").find('h3').text
print(stat)
directory = "stats_html/%s" % stat.replace('/',' ')
#need to replace to avoid
if not os.path.exists(directory):
os.makedirs(directory)
years = []
for option in html.find("select",class_="statistics-details-select").find_all("option"):
year = option['value']
if year not in years:
years.append(year)
url_filenames = []
for year in years:
url = url_stub % (stat_id,year)
filename = "%s/%s.html" % (directory,year)
if not os.path.isfile(filename): #this check saves time if you've already downloaded the page
url_filenames.append((url,filename))
jobs = [gevent.spawn(gather_pages,pair[0],pair[1]) for pair in url_filenames]
gevent.joinall(jobs)
解决方法
暂无找到可以解决该程序问题的有效方法,小编努力寻找整理中!
如果你已经找到好的解决方法,欢迎将解决方案带上本链接一起发送给小编。
小编邮箱:dio#foxmail.com (将#修改为@)