问题描述
我在构造随意数据时遇到了麻烦。我的蜘蛛会从一页获取一些数据,然后在该页面上的链接列表中获取下一页的链接。
def parse_page(self,response):
links = response.css(LINK_SELECTOR).extract()
data = {
'name': response.css(NAME_SELECTOR).extract_first(),'date': response.css(DATE_SELECTOR).extract(),}
for link in links:
next_link = response.urljoin(link)
yield scrapy.Request(next_link,callback=self.parse_url,Meta={'data': data})
def parse_url(self,response):
data = response.Meta['data']
data['url'] = response.css(a::attr(href)').get()
yield data
我想要获得的数据具有以下结构:
{'name': name,'date': date,'url': [url1,url2,url3,url4]}
代替
{'name': name,'url': url1}
{'name': name,'url': url2}
{'name': name,'url': url3}
{'name': name,'url': url4}
我尝试使用项目,但不知道如何将数据从parse_url传递给parse_page函数。我该怎么办?
谢谢。
解决方法
您可以使用scrapy的coroutine support轻松完成此操作。
代码看起来像这样:
async def parse_page(self,response):
...
for link in links:
request = response.follow(link)
response = await self.crawler.engine.download(request,self)
urls.append(response.css('a::attr(href)').get())
,
以下是实现该目标的方法之一。有一个inline_requests库可以帮助您获得预期的输出。
import scrapy
from scrapy.crawler import CrawlerProcess
from inline_requests import inline_requests
class YellowpagesSpider(scrapy.Spider):
name = "yellowpages"
start_urls = ["https://www.yellowpages.com/san-francisco-ca/mip/honey-honey-cafe-crepery-4752771"]
@inline_requests
def parse(self,response):
data = {
'name':response.css(".sales-info > h1::text").get(),'phone':response.css(".contact > p.phone::text").get(),'target_link':[]
}
for item_link in response.css(".review-info > a.author[href]::attr(href)").getall():
resp = yield scrapy.Request(response.urljoin(item_link),meta={'handle_httpstatus_all': True})
target_link = resp.css("a.review-business-name::attr(href)").get()
data['target_link'].append(target_link)
print(data)
if __name__ == "__main__":
c = CrawlerProcess({
'USER_AGENT':'Mozilla/5.0','LOG_LEVEL':'ERROR',})
c.crawl(YellowpagesSpider)
c.start()
它产生的输出:
{'name': 'Honey Honey Cafe & Crepery','phone': '(415) 351-2423','target_link': ['/san-francisco-ca/mip/honey-honey-cafe-crepery-4752771','/walnut-ca/mip/akasaka-japanese-cuisine-455476824','/san-francisco-ca/mip/honey-honey-cafe-crepery-4752771']}