带有 CrawlerProcess 的 Scrapy 无限循环

问题描述

我目前正在运行 Scrapy v2.5,我想运行无限循环。我的代码

class main():

    def bucle(self,array_spyder,process):
        mongo       = mongodb(setting)
        for spider_name in array_spider:
            process_init.crawl(spider_name,params={ "mongo": mongo,"spider_name": spider_name})
        process.start()
        process.stop()
        mongo.close_mongo()

if __name__ == "__main__":
    setting     = get_project_settings()
    while True:
        process = CrawlerProcess(setting)
        array_spider = process.spider_loader.list()
        class_main = main()
        class_main.bucle(array_spider,process)

但这导致了如下错误信息:

Traceback (most recent call last):
  File "run_scrapy.py",line 92,in <module>
    process.start()
  File "/usr/local/lib/python3.8/dist-packages/scrapy/crawler.py",line 327,in start
    reactor.run(installSignalHandlers=False)  # blocking call
  File "/usr/local/lib/python3.8/dist-packages/twisted/internet/base.py",line 1422,in run
    self.startRunning(installSignalHandlers=installSignalHandlers)
  File "/usr/local/lib/python3.8/dist-packages/twisted/internet/base.py",line 1404,in startRunning
    ReactorBase.startRunning(cast(ReactorBase,self))
  File "/usr/local/lib/python3.8/dist-packages/twisted/internet/base.py",line 843,in startRunning
    raise error.ReactorNotRestartable()
twisted.internet.error.ReactorNotRestartable

有人可以帮我吗??

解决方法

AFAIK 没有简单的方法来重新启动蜘蛛,但有一个替代方案 - 蜘蛛永远不会关闭。为此,您可以使用 spider_idle signal.

根据文档:

Sent when a spider has gone idle,which means the spider has no further:  
* requests waiting to be downloaded
* requests scheduled
* items being processed in the item pipeline

您还可以在官方 documentation 中找到使用 Signals 的示例。