Binance公告页面上的Beautiful Soup Web Scraper落后5分钟

问题描述

我使用 bs4 构建了一个网络爬虫，目的是在发布新公告时获得通知，目前我正在使用“列表”一词而不是所有公告关键字进行测试。出于某种原因，当我比较它确定发布新公告的时间与它在网站上发布的实际时间时。时间相差 5 分钟。

from bs4 import BeautifulSoup
from requests import get
import time
import sys

x = True
while x == True:
    time.sleep(30)
    # Data for the scraping
    url = "https://www.binance.com/en/support/announcement"
    response = get(url)
    html_page = response.content
    soup = BeautifulSoup(html_page,'html.parser')
    news_list = soup.find_all(class_ = 'css-qinc3w')

    # Create a bag of key words for getting matches
    key_words = ['list','token sale','open Trading','opens Trading','perpetual','defi','uniswap','airdrop','adds','updates','enabled','Trade','support']

    # Empty list
    updated_list = []

    for news in news_list:
        article_text = news.text

        if ("list" in article_text.lower()):
            updated_list.append([article_text])

        if len(updated_list) > 4:
            print(time.asctime( time.localtime(time.time()) ))
            print(article_text)
            sys.exit()

列表长度增加 1 到 5 时的响应导致打印以下时间和新公告： 2021 年 5 月 28 日星期五 04:17:39，币安将上线 Livepeer (LPT)

我不确定为什么会这样，起初我以为我被节流了，但再看看robot.txt，我没有看到我应该被节流的任何理由，而且我包括了30秒的睡眠时间应该足以进行网页抓取而不会出现任何问题。任何帮助或替代解决方案，将不胜感激。

我的问题是：

为什么晚了 5 分钟？为什么它在网站发布后不通知我，与在网站上发布的时间相比，程序需要多 5 分钟才能识别出有新帖子。

解决方法

from xrzz import http ## give it try using my simple scratch module
import json

url = "https://www.binance.com/bapi/composite/v1/public/cms/article/list/query?type=1&pageNo=1&pageSize=30"

req = http("GET",url=url,tls=True).body().decode()

key_words = ['list','token sale','open trading','opens trading','perpetual','defi','uniswap','airdrop','adds','updates','enabled','trade','support']

for i in json.loads(req)['data']['catalogs']:
    for o in i['articles']:
        if key_words[0] in o['title']:
            print(o['title'])

输出：

Result

beautifulsoup beautifulsoup binance python-3.x web-scraping

Binance公告页面上的Beautiful Soup Web Scraper落后5分钟

问题描述

解决方法

相关问答