问题描述
我正在尝试抓取 https://gmatclub.com/forum/decision-tracker.html,我能够获得大部分我想要的东西,但有时我会被 ConnectionError: ('Connection aborted.',Remotedisconnected('Remote end closed connection without response'))
困住。
我该如何解决?
我的代码是:
import requests
link = 'https://gmatclub.com/api/schools/v1/forum/app-tracker-latest-updates'
params = {
'limit': 500,'offset': 0,'year': 'all'
}
with requests.Session() as con:
con.headers["User-Agent"] = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML,like Gecko) Chrome/89.0.4389.86 Yabrowser/21.3.0.740 Yowser/2.5 Safari/537.36"
con.get("https://gmatclub.com/forum/decision-tracker.html")
while True:
endpoint = con.get(link,params=params).json()
if not endpoint["statistics"]:break
for item in endpoint["statistics"]:
print(item['school_title'])
params['offset']+=499
解决方法
一种策略可以是重复请求,直到您从服务器得到正确的响应,例如:
import requests
from time import sleep
link = "https://gmatclub.com/api/schools/v1/forum/app-tracker-latest-updates"
params = {"limit": 500,"offset": 0,"year": "all"}
with requests.Session() as con:
con.headers[
"User-Agent"
] = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML,like Gecko) Chrome/89.0.4389.86 YaBrowser/21.3.0.740 Yowser/2.5 Safari/537.36"
con.get("https://gmatclub.com/forum/decision-tracker.html")
while True:
# repeat until we got correct response from server:
while True:
try:
endpoint = con.get(link,params=params).json()
break
except requests.exceptions.ConnectionError:
sleep(3) # wait a little bit and try again
continue
if not endpoint["statistics"]:
break
for item in endpoint["statistics"]:
print(item["school_title"])
params["offset"] += 499