Spyder 网络抓取工具不会访问新网址

问题描述

我有一个 spyder，它遍历所有站点地图，并为每个 url 添加 &com=1 到 url 的末尾并请求它获取标题和评论。但是由于某种原因，请求没有通过，或者 xpath 没有找到任何东西。我知道如果我们有请求，则需要使用替换，但是当我们进入每个周期时，这仍然适用吗？如果是的话，当我们不再有那个方法时，我们如何更换？

测试网址：https://www.delfi.lt/news/daily/hot/apsinuoginusios-moterys-sutrikde-madu-sou.d?id=112&com=1

xpath 在网站上的开发控制台中工作。

代码：

class MySpider(scrapy.Spider):
       
    name = "delfi"
    root = 'http://www.delfi.lt'
    start_urls = ['https://www.delfi.lt/sitemap.xml']
    custom_settings = {
        'USER_AGENT': "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:12.0) Gecko/20100101 Firefox/12.0",'ITEM_PIPELInes': {'__main__.ArticlesPipeline': 300},'HTTPCACHE_ENABLED': True,'HTTPCACHE_EXPIRATION_SECS': 0,'DOWNLOAD_DELAY': 0.1,'LOG_LEVEL': 'INFO','LOG_FILE': 'delfi_scraping_logs.log','COOKIES_ENABLED':False

    }
    

    
    def try_to_do(self,func,arg):
        try:
            return func(arg)
        except Exception:
            self.logger.exception('trying to do: ')

    def parse(self,response):
        response.selector.remove_namespaces()
        sitemaps = response.xpath('//loc/text()').extract()

        for sitemap in sitemaps:
            yield scrapy.Request(sitemap,callback=self.parse_sitemap)


    def parse_sitemap(self,response):
        response.selector.remove_namespaces()
        articles = response.xpath('//loc/text()').extract()
        logger.info(f'starting articles from sitemap {response.url}')
        for article in articles:
            if "delfi.lt/video" in article or "delfi.lt/apps" in article or \
                    "delfi.lt/temos/" in article or 'delfi.lt/images/' in article: #or \
                    #article in self.existing_urls
                logger.info(f'skipping {article}')
                continue
            
            ### This does not work

            # new_article = article + '&com=1' #direct change did not help
            payload = {'com' : 1}

            ### This does not work

            yield scrapy.Request(article+ "&" + urlencode(payload),callback=self.parse_article)
            

    def manage_sequence_of_strings(self,seq):
        return ' '.join([s.strip() for s in seq]).replace('\xa0',' ').replace('\n',' ').replace('  ',' ').strip()

    def get_fields(self,response):

        kwargs = {k: self.try_to_do(v,response) for k,v in [('author',self.xauthor),('title',self.xtitle),]}

      
        return kwargs

    

    def xauthor(self,resp):
        author = resp.xpath('//div[@class="delfi-source-name"]/text()').extract()
        if len(author) > 0:
            return author[0]


    def xtitle(self,resp):
        return self.manage_sequence_of_strings(        
            resp.xpath('//*[@class="article-title"]//text()').extract())




    def parse_article(self,response):
        url = response.url
        print('URl = ' + url)
        
        kwargs = self.get_fields(response)
        
        yield {'url': url,'website':'delfi','category': None,'success': all(kwargs.values()),**kwargs}

解决方法

暂无找到可以解决该程序问题的有效方法，小编努力寻找整理中！

如果你已经找到好的解决方法，欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@）

python scrapy sitemap

Spyder 网络抓取工具不会访问新网址

问题描述

解决方法

相关问答