问题描述
抱歉,我是 Python 和网页抓取的初学者。
我正在网络抓取 wugniu.com 以提取我输入的字符的读数。我制作了一个包含 10273 个字符的列表以格式化为 URL 并显示带有读数的页面,然后我使用 Requests 模块返回源代码,然后使用 Beautiful Soup 返回所有音频 ID(因为它们的字符串包含输入字符 - 我无法使用表格中出现的文本,因为它们是 svgs)。然后我尝试将字符及其读数输出到 out.txt。
# -*- coding: utf-8 -*-
import requests,time
from bs4 import BeautifulSoup
from requests.packages.urllib3.exceptions import InsecureRequestWarning
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)
characters = [
#characters go here
]
output = open("out.txt","a",encoding="utf-8")
tic = time.perf_counter()
for char in characters:
# Characters from the list are formatted into the url
url = "https://wugniu.com/search?char=%s&table=wenzhou" % char
page = requests.get(url,verify=False)
soup = BeautifulSoup(page.text,'html.parser')
for audio_tag in soup.find_all('audio'):
audio_id = audio_tag.get('id').replace("0-","")
#output.write(char)
#output.write(" ")
#output.write(audio_id)
#output.write("\n")
print(i)
time.sleep(60)
output.close()
toc = time.perf_counter()
duration = int(toc) - int(tic)
print("Took %d seconds" % duration)
out.txt
是我试图将结果输出到的输出文件。我测量了该过程用于衡量性能的时间。
但是,在大约 50 个循环之后,我在 cmd 中得到了这个:
Traceback (most recent call last):
File "C:\Users\[user]\Documents\wenzhou-ime\env\lib\site-packages\urllib3\connection.py",line 169,in _new_conn
conn = connection.create_connection(
File "C:\Users\[user]\Documents\wenzhou-ime\env\lib\site-packages\urllib3\util\connection.py",line 96,in create_connection
raise err
File "C:\Users\[user]\Documents\wenzhou-ime\env\lib\site-packages\urllib3\util\connection.py",line 86,in create_connection
sock.connect(sa)
TimeoutError: [WinError 10060] A connection attempt Failed because the connected party did not properly respond after a period of time,or established connection Failed because connected host has Failed to respond
During handling of the above exception,another exception occurred:
Traceback (most recent call last):
File"C:\Users\[user]\Documents\wenzhou-ime\env\lib\site-packages\urllib3\connectionpool.py",line 699,in urlopen httplib_response = self._make_request(
File "C:\Users\[user]\Documents\wenzhou-ime\env\lib\site-packages\urllib3\connectionpool.py",line 382,in _make_request
self._validate_conn(conn)
File "C:\Users\[user]\Documents\wenzhou-ime\env\lib\site-packages\urllib3\connectionpool.py",line 1010,in _validate_conn
conn.connect()
File "C:\Users\[user]\Documents\wenzhou-ime\env\lib\site-packages\urllib3\connection.py",line 353,in connect
conn = self._new_conn()
File "C:\Users\[user]\Documents\wenzhou-ime\env\lib\site-packages\urllib3\connection.py",line 181,in _new_conn
raise NewConnectionError(urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPSConnection object at 0x000002035D5F9040>: Failed to establish a new connection: [WinError 10060] A connection attempt Failed because the connected party did not properly respond after a period of time,or established connection Failed because connected host has Failed to respond
During handling of the above exception,another exception occurred:
Traceback (most recent call last):
File "C:\Users\[user]\Documents\wenzhou-ime\env\lib\site-packages\requests\adapters.py",line 439,in send
resp = conn.urlopen(
File "C:\Users\[user]\Documents\wenzhou-ime\env\lib\site-packages\urllib3\connectionpool.py",line 755,in urlopen
retries = retries.increment(
File "C:\Users\[user]\Documents\wenzhou-ime\env\lib\site-packages\urllib3\util\retry.py",line 573,in increment
raise MaxRetryError(_pool,url,error or ResponseError(cause)) urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='wugniu.com',port=443): Max retries exceeded with url: /search?char=%E8%87%B4&table=wenzhou (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x000002035D5F9040>: Failed to establish a new connection: [WinError 10060] A connection attempt Failed because the connected party did not properly respond after a period of time,or established connection Failed because connected host has Failed to respond'))
During handling of the above exception,another exception occurred:
Traceback (most recent call last):
File "C:\Users\[user]\Documents\wenzhou-ime\test.py",line 3282,in <module> page = requests.get(url,verify=False)
File "C:\Users\[user]\Documents\wenzhou-ime\env\lib\site-packages\requests\api.py",line 76,in get
return request('get',params=params,**kwargs) File "C:\Users\[user]\Documents\wenzhou-ime\env\lib\site-packages\requests\api.py",line 61,in request
return session.request(method=method,url=url,**kwargs) File "C:\Users\[user]\Documents\wenzhou-ime\env\lib\site-packages\requests\sessions.py",line 542,in request
resp = self.send(prep,**send_kwargs) File "C:\Users\[user]\Documents\wenzhou-ime\env\lib\site-packages\requests\sessions.py",line 655,in send
r = adapter.send(request,**kwargs) File "C:\Users\[user]\Documents\wenzhou-ime\env\lib\site-packages\requests\adapters.py",line 516,in send
raise ConnectionError(e,request=request)
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='wugniu.com',or established connection Failed because connected host has Failed to respond'))
我尝试通过添加 time.sleep(60)
来解决此问题,但错误仍然发生。当我昨天制作这个脚本时,我能够以最多 1500 个字符的列表运行它,没有错误。有人可以帮我解决这个问题吗?谢谢。
解决方法
这完全是预期的正常行为。因为这与 Chicken-Egg
问题有关。
想象一下,您打开 Firefox
浏览器,然后打开 google.com
,然后关闭它并重复圆圈!
这算作 DDOS
攻击,所有现代服务器都会阻止您的请求并标记您的 IP,因为这确实会损害他们的带宽!
合乎逻辑且正确的方法是使用相同的 session 而不是继续创建多个会话。因为它不会显示在 TCP-Syn Flood 标志下。检查法律 tcp-flags。
另一方面,您确实需要使用 Context-Manager 而不是记住您的变量。
示例:
output = open("out.txt","a",encoding="utf-8")
output.close()
可以通过 With
处理,如下所示:
with open('out.txt','w',newline='',encoding='utf-8') as output:
# here you can do your operation.
一旦您退出 with
,您的文件将自动关闭!
还要考虑使用新的 format string
而不是旧的
url = "https://wugniu.com/search?char=%s&table=wenzhou" % char
可以是:
"https://wugniu.com/search?char={}&table=wenzhou".format(char)
我不会在这里使用专业代码,我已经使您可以轻松理解这个概念。
注意我如何获取所需的 element
以及我如何将其写入文件。可以找到 lxml
和 html.parser
的不同速度here
import requests
from bs4 import BeautifulSoup
import urllib3
urllib3.disable_warnings()
def main(url,chars):
with open('result.txt',encoding='utf-8') as f,requests.Session() as req:
req.verify = False
for char in chars:
print(f"Extracting {char}")
r = req.get(url.format(char))
soup = BeautifulSoup(r.text,'lxml')
target = [x['id'][2:] for x in soup.select('audio[id^="0-"]')]
print(target)
f.write(f'{char}\n{str(target)}\n')
if __name__ == "__main__":
chars = ['核']
main('https://wugniu.com/search?char={}&table=wenzhou',chars)
同样遵循 Python Dry Principle 您可以设置 req.verify = False
而不是在每个请求上继续设置 verify = False
。
下一步:您应该查看线程或异步编程以提高您的代码操作时间,因为在现实世界的项目中,我们没有使用普通的 for 循环(算作非常慢),而您可以发送一堆网址并等待响应。