我该如何遍历URL列表以在Scrapy中抓取数据？

问题描述

import scrapy
class oneplus_spider(scrapy.Spider):
    name='one_plus'
    page_number=0
    start_urls=[
        'https://www.amazon.com/s?k=samsung+mobile&page=3&qid=1600763713&ref=sr_pg_3'
    ]
     
    def parse(self,response):
        all_links=[]
        total_links=[]
        domain='https://www.amazon.com'
        href=[]
        link_set=set()
        
        href=response.css('a.a-link-normal.a-text-normal').xpath('@href').extract()
        for x in href:
            link_set.add(domain+x)
        for x in link_set:
            next_page=x
            yield response.follow(next_page,callback=self.parse_page1)


    def parse_page1(self,response):
        title=response.css('span.a-size-large product-title-word-break::text').extract()
        print(title)

运行代码后发生错误-（失败2次）：503服务不可用。我尝试了很多方法，但是失败了。请帮我。预先感谢！

解决方法

首先通过“ curl”检查网址。喜欢，

curl -I "https://www.amazon.com/s?k=samsung+mobile&page=3&qid=1600763713&ref=sr_pg_3"

然后，您会看到503响应。

HTTP/2 503

换句话说，您的请求是错误的。

您必须找到适当的请求。

Chrome DevTools将为您提供帮助。喜欢

我认为必须需要用户代理（如浏览器）。

curl 'https://www.amazon.com/s?k=samsung+mobile&page=3&qid=1600763713&ref=sr_pg_3' \
  -H 'user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/537.36 (KHTML,like Gecko) Chrome/85.0.4183.102 Safari/537.36' \
   --compressed

所以...可能有效，

import scrapy
class oneplus_spider(scrapy.Spider):
    name='one_plus'
    page_number=0
    user_agent = "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML,like Gecko) Chrome/44.0.2403.157 Safari/537.36"
    start_urls=[
        'https://www.amazon.com/s?k=samsung+mobile&page=3&qid=1600763713&ref=sr_pg_3'
    ]
     
    def parse(self,response):
        all_links=[]
        total_links=[]
        domain='https://www.amazon.com'
        href=[]
        link_set=set()
        
        href=response.css('a.a-link-normal.a-text-normal').xpath('@href').extract()
        for x in href:
            link_set.add(domain+x)
        for x in link_set:
            next_page=x
            yield response.follow(next_page,callback=self.parse_page1)


    def parse_page1(self,response):
        title=response.css('span.a-size-large product-title-word-break::text').extract()
        print(title)

scrape scrapy url