如何在循环中使用 Scrapy FormRequest

问题描述

我正在尝试创建一个蜘蛛程序,它将列表中的单词一个一个地放入引用的搜索输入中,然后解析结果页面中的文本。

它适用于一个词,但我不能让它适用于整个列表。猜猜我应该(不知何故)把循环放在蜘蛛里面?

我的代码如下。它是作为其他几个 Stack Overflow 建议的汇编而产生的。问题是爬虫会更新到 words 中的最后一个单词并忽略列表的其余部分。由于“ReactorNotRestartable”错误,我无法将 crawler.start() 放入循环中。

class FirstSpider(scrapy.Spider):
    name = 'ruscorpora'

    def start_requests(self):
            yield scrapy.Request('https://ruscorpora.ru/new/search-main.html',callback=self.form_input)
    
    def form_input(self,response):
            return scrapy.FormRequest.from_response(response,formdata={'req': the_word},callback=self.parse_freq)
    
    def parse_freq(self,response):
        xpath = "/html/body/div[4]/p[3]/span[3]/text()"
        message = response.xpath(xpath).extract_first()
        
        if message is None:            #in case there isn't a word like that
            result.append(0)
        else:
            result.append(message)

words = ['parrot','patriot','partjbndonfc']
result = []

for the_word in words:
    crawler = CrawlerProcess()
    crawler.crawl(FirstSpider,the_word)

crawler.start()

解决方法

你可以像这样

class FirstSpider(scrapy.Spider):
    name = 'ruscorpora'

    def start_requests(self):
        yield scrapy.Request('https://ruscorpora.ru/new/search-main.html',callback=self.form_input)
    
    def form_input(self,response):
    
        words = ['parrot','patriot','partjbndonfc']
        for word in 
            yield scrapy.FormRequest.from_response(response,formdata={'req': word},callback=self.parse_freq)
    
    def parse_freq(self,response):
        xpath = "/html/body/div[4]/p[3]/span[3]/text()"
        message = response.xpath(xpath).extract_first()
        
        if message is None:            #in case there isn't a word like that
            result.append(0)
        else:
            result.append(message)

result = []

crawler = CrawlerProcess()
crawler.crawl(FirstSpider)

crawler.start()