scrapy webcrawler 关键字参数

问题描述

我有一个scrapy webcrawler,它旨在跟踪包含我的一组关键字之一的链接。我感觉这种情况正在发生,但我的输出 csv 文件缺少许多链接,我不知道为什么。蜘蛛在下面。

def find_all_substrings(string,sub):
    starts = [match.start() for match in re.finditer(re.escape(sub),string)]
    return starts

class GraphSpider(CrawlSpider):
    name = 'agavecovidbot'
    custom_settings = {
    'DEPTH_LIMIT': '2',}
    allowed_domains = [domain1.com]
    start_urls = ["Domain1.com/specifiedpage1","Domain1.com/specifiedpage2"]
    rules = [Rule(LinkExtractor(),follow=True,callback="parse_item")]

    def parse_item(self,response):
        print ("***********************************************************stop")
        yield self.check_buzzwords(response)
        yield self.fetch_links(response)
        yield self.response_downloaded(response)
        
    def check_buzzwords(self,response):
        wordlist = ["key1","key2"
                    ]
        
        url = response.url
        contenttype = response.headers.get("content-type","").decode('utf-8').lower()
        data = response.body.decode('utf-8')
        
        for word in wordlist:
            substrings = find_all_substrings(data,word)
            for pos in substrings:
                ok = False
                if not ok:
                    if os.path.exists('1url-to-key.csv'):
                        append_write = 'a'
                    else:
                        append_write = 'w'
                    with open('1url-to-key.csv',append_write) as url_f:
                        url_f.write(word + "&,&" + url + "\n")
        return Item()
    
    def fetch_links(self,response):
        links = LinkExtractor(canonicalize=True,unique=True).extract_links(response)
        for link in links:
            if os.path.exists('1url-to-url.csv'):
                append_write = 'a'
            else:
                append_write = 'w'
            
            with open('1url-to-url.csv',append_write) as url_f:
                url_f.write(response.url + "&,&" + link.url + "\n")
        
        return Item()

    def _requests_to_follow(self,response):
        if getattr(response,"encoding",None) != None:
            return CrawlSpider._requests_to_follow(self,response)
        else:
            return []
        
    def response_downloaded(self,response):
        filename = response.url.split("/")[-1] + '.html'
        with open(filename,'wb') as f:
            f.write(response.body)    
        rule = self._rules[response.Meta['rule']]
        return self._parse_response(response,rule.callback,rule.cb_kwargs,rule.follow)

关于如何修复 url-to-key csv 文件的任何建议,以便它返回我希望看到的所有组合?谢谢!

解决方法

暂无找到可以解决该程序问题的有效方法,小编努力寻找整理中!

如果你已经找到好的解决方法,欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@)