问题描述
我有一个scrapy webcrawler,它旨在跟踪包含我的一组关键字之一的链接。我感觉这种情况正在发生,但我的输出 csv 文件缺少许多链接,我不知道为什么。蜘蛛在下面。
def find_all_substrings(string,sub):
starts = [match.start() for match in re.finditer(re.escape(sub),string)]
return starts
class GraphSpider(CrawlSpider):
name = 'agavecovidbot'
custom_settings = {
'DEPTH_LIMIT': '2',}
allowed_domains = [domain1.com]
start_urls = ["Domain1.com/specifiedpage1","Domain1.com/specifiedpage2"]
rules = [Rule(LinkExtractor(),follow=True,callback="parse_item")]
def parse_item(self,response):
print ("***********************************************************stop")
yield self.check_buzzwords(response)
yield self.fetch_links(response)
yield self.response_downloaded(response)
def check_buzzwords(self,response):
wordlist = ["key1","key2"
]
url = response.url
contenttype = response.headers.get("content-type","").decode('utf-8').lower()
data = response.body.decode('utf-8')
for word in wordlist:
substrings = find_all_substrings(data,word)
for pos in substrings:
ok = False
if not ok:
if os.path.exists('1url-to-key.csv'):
append_write = 'a'
else:
append_write = 'w'
with open('1url-to-key.csv',append_write) as url_f:
url_f.write(word + "&,&" + url + "\n")
return Item()
def fetch_links(self,response):
links = LinkExtractor(canonicalize=True,unique=True).extract_links(response)
for link in links:
if os.path.exists('1url-to-url.csv'):
append_write = 'a'
else:
append_write = 'w'
with open('1url-to-url.csv',append_write) as url_f:
url_f.write(response.url + "&,&" + link.url + "\n")
return Item()
def _requests_to_follow(self,response):
if getattr(response,"encoding",None) != None:
return CrawlSpider._requests_to_follow(self,response)
else:
return []
def response_downloaded(self,response):
filename = response.url.split("/")[-1] + '.html'
with open(filename,'wb') as f:
f.write(response.body)
rule = self._rules[response.Meta['rule']]
return self._parse_response(response,rule.callback,rule.cb_kwargs,rule.follow)
关于如何修复 url-to-key csv 文件的任何建议,以便它返回我希望看到的所有组合?谢谢!
解决方法
暂无找到可以解决该程序问题的有效方法,小编努力寻找整理中!
如果你已经找到好的解决方法,欢迎将解决方案带上本链接一起发送给小编。
小编邮箱:dio#foxmail.com (将#修改为@)