错误Web抓取页面未重新连接,但可以重新启动

问题描述

我正在抓取一个网站,有时它会向我发送此消息,并且没有重新连接到目标网页

2020-08-18 22:37:30 [rotating_proxies.middlewares] INFO: Proxies(good: 1,dead: 0,unchecked: 0,reanimated: 0,mean backoff time: 0s)
2020-08-18 22:38:00 [scrapy.extensions.logstats] INFO: Crawled 116421 pages (at 35 pages/min),scraped 116421 items (at 35 items/min)
2020-08-18 22:38:00 [rotating_proxies.middlewares] INFO: Proxies(good: 1,mean backoff time: 0s)
2020-08-18 22:38:30 [rotating_proxies.middlewares] INFO: Proxies(good: 1,mean backoff time: 0s)
2020-08-18 22:39:00 [scrapy.extensions.logstats] INFO: Crawled 116421 pages (at 0 pages/min),scraped 116421 items (at 0 items/min)
2020-08-18 22:39:00 [rotating_proxies.middlewares] INFO: Proxies(good: 1,mean backoff time: 0s)
2020-08-18 22:39:30 [rotating_proxies.middlewares] INFO: Proxies(good: 1,mean backoff time: 0s)
2020-08-18 22:40:00 [scrapy.extensions.logstats] INFO: Crawled 116421 pages (at 0 pages/min),scraped 116421 items (at 0 items/min)
2020-08-18 22:40:00 [rotating_proxies.middlewares] INFO: Proxies(good: 1,mean backoff time: 0s)
2020-08-18 22:40:30 [rotating_proxies.middlewares] INFO: Proxies(good: 1,mean backoff time: 0s)
2020-08-18 22:41:00 [scrapy.extensions.logstats] INFO: Crawled 116421 pages (at 0 pages/min),scraped 116421 items (at 0 items/min)
2020-08-18 22:41:00 [rotating_proxies.middlewares] INFO: Proxies(good: 1,mean backoff time: 0s)
2020-08-18 22:41:30 [rotating_proxies.middlewares] INFO: Proxies(good: 1,mean backoff time: 0s)
2020-08-18 22:42:00 [scrapy.extensions.logstats] INFO: Crawled 116421 pages (at 0 pages/min),scraped 116421 items (at 0 items/min)
2020-08-18 22:42:00 [rotating_proxies.middlewares] INFO: Proxies(good: 1,mean backoff time: 0s)
2020-08-18 22:42:30 [rotating_proxies.middlewares] INFO: Proxies(good: 1,mean backoff time: 0s)
2020-08-18 22:43:00 [scrapy.extensions.logstats] INFO: Crawled 116421 pages (at 0 pages/min),scraped 116421 items (at 0 items/min)
2020-08-18 22:43:00 [rotating_proxies.middlewares] INFO: Proxies(good: 1,mean backoff time: 0s)

我使用轮换代理,每小时更新一次。尝试使用其他蜘蛛的代理,它在同一页面上可以正常工作。 可能是什么问题?,我该如何挽救已被抓取的数据?

代码:

import scrapy

class Pool(scrapy.Spider):
    name = 'pool'
    start_urls = [l.strip() for l in open("D:\links.txt").readlines()]

    def parse(self,response):
        pool1 = response.xpath("/html/[6]").get('').strip()
        url = response.url
        yield {
            'Pool1': pool1,'Url ': url,}

设置:

BOT_NAME = 'Pool'

SPIDER_MODULES = ['Pool.spiders']
NEWSPIDER_MODULE = 'Pool.spiders'

ROBOTSTXT_OBEY = False
FEED_EXPORTERS = {
    'xlsx': 'scrapy_xlsx.XlsxItemExporter',}
DOWNLOAD_TIMEOUT = 3600
DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,'scrapy_user_agents.middlewares.RandomUserAgentMiddleware': 400,'rotating_proxies.middlewares.RotatingProxyMiddleware': 610,'rotating_proxies.middlewares.BanDetectionMiddleware': 620,}
COOKIES_ENABLED = False
ROTATING_PROXY_LIST = [
    'IPproxyhttp',]

解决方法

我认为页面或所有代理可能同时断开连接,并且正在等待DOWNLOAD_TIMEOUT

相关问答

依赖报错 idea导入项目后依赖报错,解决方案:https://blog....
错误1:代码生成器依赖和mybatis依赖冲突 启动项目时报错如下...
错误1:gradle项目控制台输出为乱码 # 解决方案:https://bl...
错误还原:在查询的过程中,传入的workType为0时,该条件不起...
报错如下,gcc版本太低 ^ server.c:5346:31: 错误:‘struct...