问题描述
我正在尝试创建一个蜘蛛程序,它将列表中的单词一个一个地放入引用的搜索输入中,然后解析结果页面中的文本。
它适用于一个词,但我不能让它适用于整个列表。猜猜我应该(不知何故)把循环放在蜘蛛里面?
我的代码如下。它是作为其他几个 Stack Overflow 建议的汇编而产生的。问题是爬虫会更新到 words
中的最后一个单词并忽略列表的其余部分。由于“ReactorNotRestartable”错误,我无法将 crawler.start()
放入循环中。
class FirstSpider(scrapy.Spider):
name = 'ruscorpora'
def start_requests(self):
yield scrapy.Request('https://ruscorpora.ru/new/search-main.html',callback=self.form_input)
def form_input(self,response):
return scrapy.FormRequest.from_response(response,formdata={'req': the_word},callback=self.parse_freq)
def parse_freq(self,response):
xpath = "/html/body/div[4]/p[3]/span[3]/text()"
message = response.xpath(xpath).extract_first()
if message is None: #in case there isn't a word like that
result.append(0)
else:
result.append(message)
words = ['parrot','patriot','partjbndonfc']
result = []
for the_word in words:
crawler = CrawlerProcess()
crawler.crawl(FirstSpider,the_word)
crawler.start()
解决方法
你可以像这样
class FirstSpider(scrapy.Spider):
name = 'ruscorpora'
def start_requests(self):
yield scrapy.Request('https://ruscorpora.ru/new/search-main.html',callback=self.form_input)
def form_input(self,response):
words = ['parrot','patriot','partjbndonfc']
for word in
yield scrapy.FormRequest.from_response(response,formdata={'req': word},callback=self.parse_freq)
def parse_freq(self,response):
xpath = "/html/body/div[4]/p[3]/span[3]/text()"
message = response.xpath(xpath).extract_first()
if message is None: #in case there isn't a word like that
result.append(0)
else:
result.append(message)
result = []
crawler = CrawlerProcess()
crawler.crawl(FirstSpider)
crawler.start()