问题描述
我想将一堆已爬网的项目到达BATCH_SIZE
时发布到RES API。
下载图像后,我应该在哪里获取图像的绝对路径以将爬网的项目发布到REST API?
我使用scrapyd
部署项目。
items.py
class MyItem(Item):
name = Field()
images = Field()
image_urls = Field()
image_paths = Field()
pipelines.py
class MyImagesPipeline(ImagesPipeline):
def get_media_requests(self,item,info):
for image_url in item['image_urls']:
yield scrapy.Request(image_url)
def item_completed(self,results,info):
image_paths = [x['path'] for ok,x in results if ok]
if not image_paths:
raise DropItem("Item contains no images")
adapter = ItemAdapter(item)
adapter['image_paths'] = image_paths
return item
middlewares.py
class FooSpiderMiddleware(object):
self.bulk_items = []
@classmethod
def from_crawler(cls,crawler):
s = cls()
crawler.signals.connect(s.spider_opened,signal=signals.spider_opened)
return s
def process_spider_input(self,response,spider):
return None
def process_spider_output(self,result,spider):
for i in result:
yield i
def process_spider_exception(self,exception,spider):
pass
def process_start_requests(self,start_requests,spider):
result_list = list(result)
if isinstance(result_list[-1],Request):
self.bulk_items.extend(result_list[:-1])
else:
self.bulk_items.extend(result_list)
if len(self.bulk_items) == BATCH_SIZE:
# post here
self.bulk_items = []
result_restore = (i for i in result_list)
for i in result_restore:
yield i
def spider_opened(self,spider):
spider.logger.info('Spider opened: %s' % spider.name)
解决方法
暂无找到可以解决该程序问题的有效方法,小编努力寻找整理中!
如果你已经找到好的解决方法,欢迎将解决方案带上本链接一起发送给小编。
小编邮箱:dio#foxmail.com (将#修改为@)