问题描述
我有一个 spyder,它遍历所有站点地图,并为每个 url 添加 &com=1 到 url 的末尾并请求它获取标题和评论。但是由于某种原因,请求没有通过,或者 xpath 没有找到任何东西。我知道如果我们有请求,则需要使用替换,但是当我们进入每个周期时,这仍然适用吗?如果是的话,当我们不再有那个方法时,我们如何更换?
测试网址:https://www.delfi.lt/news/daily/hot/apsinuoginusios-moterys-sutrikde-madu-sou.d?id=112&com=1
xpath 在网站上的开发控制台中工作。
代码:
class MySpider(scrapy.Spider):
name = "delfi"
root = 'http://www.delfi.lt'
start_urls = ['https://www.delfi.lt/sitemap.xml']
custom_settings = {
'USER_AGENT': "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:12.0) Gecko/20100101 Firefox/12.0",'ITEM_PIPELInes': {'__main__.ArticlesPipeline': 300},'HTTPCACHE_ENABLED': True,'HTTPCACHE_EXPIRATION_SECS': 0,'DOWNLOAD_DELAY': 0.1,'LOG_LEVEL': 'INFO','LOG_FILE': 'delfi_scraping_logs.log','COOKIES_ENABLED':False
}
def try_to_do(self,func,arg):
try:
return func(arg)
except Exception:
self.logger.exception('trying to do: ')
def parse(self,response):
response.selector.remove_namespaces()
sitemaps = response.xpath('//loc/text()').extract()
for sitemap in sitemaps:
yield scrapy.Request(sitemap,callback=self.parse_sitemap)
def parse_sitemap(self,response):
response.selector.remove_namespaces()
articles = response.xpath('//loc/text()').extract()
logger.info(f'starting articles from sitemap {response.url}')
for article in articles:
if "delfi.lt/video" in article or "delfi.lt/apps" in article or \
"delfi.lt/temos/" in article or 'delfi.lt/images/' in article: #or \
#article in self.existing_urls
logger.info(f'skipping {article}')
continue
### This does not work
# new_article = article + '&com=1' #direct change did not help
payload = {'com' : 1}
### This does not work
yield scrapy.Request(article+ "&" + urlencode(payload),callback=self.parse_article)
def manage_sequence_of_strings(self,seq):
return ' '.join([s.strip() for s in seq]).replace('\xa0',' ').replace('\n',' ').replace(' ',' ').strip()
def get_fields(self,response):
kwargs = {k: self.try_to_do(v,response) for k,v in [('author',self.xauthor),('title',self.xtitle),]}
return kwargs
def xauthor(self,resp):
author = resp.xpath('//div[@class="delfi-source-name"]/text()').extract()
if len(author) > 0:
return author[0]
def xtitle(self,resp):
return self.manage_sequence_of_strings(
resp.xpath('//*[@class="article-title"]//text()').extract())
def parse_article(self,response):
url = response.url
print('URl = ' + url)
kwargs = self.get_fields(response)
yield {'url': url,'website':'delfi','category': None,'success': all(kwargs.values()),**kwargs}
解决方法
暂无找到可以解决该程序问题的有效方法,小编努力寻找整理中!
如果你已经找到好的解决方法,欢迎将解决方案带上本链接一起发送给小编。
小编邮箱:dio#foxmail.com (将#修改为@)