python – 在for循环中运行多个spider

我尝试实例化多个蜘蛛.第一个工作正常,但第二个给我一个错误:ReactorNotRestartable.

Feeds = {
    'nasa': {
        'name': 'nasa','url': 'https://www.nasa.gov/RSS/dyn/breaking_news.RSS','start_urls': ['https://www.nasa.gov/RSS/dyn/breaking_news.RSS']
    },'xkcd': {
        'name': 'xkcd','url': 'http://xkcd.com/RSS.xml','start_urls': ['http://xkcd.com/RSS.xml']
    }    
}

通过上面的项目,我尝试在循环中运行两个蜘蛛,如下所示:

from scrapy.crawler import CrawlerProcess
from scrapy.spiders import XMLFeedSpider

class MySpider(XMLFeedSpider):

    name = None

    def __init__(self,**kwargs):

        this_Feed = Feeds[self.name]
        self.start_urls = this_Feed.get('start_urls')
        self.iterator = 'iternodes'
        self.itertag = 'items'
        super(MySpider,self).__init__(**kwargs)

def parse_node(self,response,node):
    pass


def start_crawler():
    process = CrawlerProcess({
        'USER_AGENT': CONfig['USER_AGENT'],'DOWNLOAD_HANDLERS': {'s3': None} # boto issues
    })

    for Feed_name in Feeds.keys():
        MySpider.name = Feed_name
        process.crawl(MySpider)
        process.start()

第二个循环的例外看起来像这样,蜘蛛打开了,但随后:

...
2015-11-22 00:00:00 [scrapy] INFO: Spider opened
2015-11-22 00:00:00 [scrapy] INFO: Crawled 0 pages (at 0 pages/min),scraped 0 items (at 0 items/min)
2015-11-22 00:00:00 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-11-21 23:54:05 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
Traceback (most recent call last):
  File "env/bin/start_crawler",line 9,in <module>
    load_entry_point('Feed-crawler==0.0.1','console_scripts','start_crawler')()
  File "/Users/bling/py-Feeds-crawler/Feed_crawler/crawl.py",line 51,in start_crawler
    process.start() # the script will block here until the crawling is finished
  File "/Users/bling/py-Feeds-crawler/env/lib/python2.7/site-packages/scrapy/crawler.py",line 251,in start
    reactor.run(installSignalHandlers=False)  # blocking call
  File "/usr/local/lib/python2.7/site-packages/twisted/internet/base.py",line 1193,in run
    self.startRunning(installSignalHandlers=installSignalHandlers)
  File "/usr/local/lib/python2.7/site-packages/twisted/internet/base.py",line 1173,in startRunning
    ReactorBase.startRunning(self)
  File "/usr/local/lib/python2.7/site-packages/twisted/internet/base.py",line 684,in startRunning
    raise error.ReactorNotRestartable()
twisted.internet.error.ReactorNotRestartable

我是否必须使第一个MySpider无效或我做错了什么,需要改变它的工作原理.提前致谢.

解决方法

看起来你必须为每个蜘蛛实例化一个进程,尝试:

def start_crawler():      

    for Feed_name in Feeds.keys():
        process = CrawlerProcess({
            'USER_AGENT': CONfig['USER_AGENT'],'DOWNLOAD_HANDLERS': {'s3': None} # boto issues
        })
        MySpider.name = Feed_name
        process.crawl(MySpider)
        process.start()

相关文章

功能概要:(目前已实现功能)公共展示部分:1.网站首页展示...
大体上把Python中的数据类型分为如下几类: Number(数字) ...
开发之前第一步,就是构造整个的项目结构。这就好比作一幅画...
源码编译方式安装Apache首先下载Apache源码压缩包,地址为ht...
前面说完了此项目的创建及数据模型设计的过程。如果未看过,...
python中常用的写爬虫的库有urllib2、requests,对于大多数比...