Scrapy 请求没有通过

问题描述

我不知道如何准确地描述这个问题。我是网络抓取的初学者,我正在尝试使用 Python Scrapy 抓取网站。该网站是动态的,使用 javascript,无法使用基本级别的 xpath 和 CSS 选择器检索任何数据。

我试图通过请求包含 json 对象数据的 url 来通过我的蜘蛛模拟 API 请求。该请求 url 抛出 HTTP 状态代码未处理或不允许错误。 我想我调用错误的 URL。这种直接调用 json 对象 url 的方法我有用 9/10 次。我可以做什么不同? 该 url 在标题部分具有参数和表单数据项,该 url 甚至看起来不像一个有效的网站 url 它以 https://ih3kc909gb-dsn.algolia.net/1/indexes... 开头。 我知道这是一个很长的问题,但我真的可以使用一些帮助来解决这个问题吗?

解决方法

您应该使用 start_requests() 方法而不是 start_urls 属性。您可以从 here 阅读更多相关信息。现在,您需要做的就是发出 POST 请求。

代码

import scrapy

class carswitch(scrapy.Spider):
    name = 'car'

    headers = {
        "Connection": "keep-alive","Pragma": "no-cache","Cache-Control": "no-cache","sec-ch-ua": "\" Not;A Brand\";v=\"99\",\"Google Chrome\";v=\"91\",\"Chromium\";v=\"91\"","accept": "application/json","sec-ch-ua-mobile": "?0","User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML,like Gecko) Chrome/91.0.4472.114 Safari/537.36","content-type": "application/x-www-form-urlencoded","Origin": "https://carswitch.com","Sec-Fetch-Site": "cross-site","Sec-Fetch-Mode": "cors","Sec-Fetch-Dest": "empty","Referer": "https://carswitch.com/","Accept-Language": "en-US,en;q=0.9"
    }

    body = '{"params":"query=&hitsPerPage=24&page=0&numericFilters=%5B%22country_id%3D1%22%2C%22used_car%20%3D%201%22%5D&facetFilters=&typoTolerance=&tagFilters=%5B%5D&attributesToHighlight=%5B%5D&attributesToRetrieve=%5B%22make%22%2C%22make_ar%22%2C%22model%22%2C%22model_ar%22%2C%22year%22%2C%22trim%22%2C%22displayTrim%22%2C%22colorPaint%22%2C%22bodyType%22%2C%22salePrice%22%2C%22transmissionType%22%2C%22GPS%22%2C%22carID%22%2C%22inspectionID%22%2C%22inspectionStatus%22%2C%22rate%22%2C%22certified_dealer_id%22%2C%22dealer_category%22%2C%22used_car%22%2C%22new%22%2C%22top_condition%22%2C%22featured%22%2C%22photo%22%2C%22modifiedPlace%22%2C%22city%22%2C%22mileage%22%2C%22urgent_sales%22%2C%22price_dropped%22%2C%22urgent_sales_days%22%2C%22urgent_sales_end_date%22%2C%22date%22%2C%22negotiable%22%2C%22oldPrice%22%2C%22zero_downpayment%22%2C%22cashOnly%22%2C%22hasPriceGuidance%22%2C%22dealerOffer%22%2C%22maxPrice%22%2C%22fairPrice%22%2C%22pricey_deal%22%2C%22fair_deal%22%2C%22good_deal%22%2C%22great_deal%22%2C%22dealership_info%22%2C%22logo_small%22%2C%22GCCspecs%22%2C%22country%22%2C%22export%22%2C%22monthly_price%22%5D"}'

    def start_requests(self):
        url = 'https://ih3kc909gb-dsn.algolia.net/1/indexes/All_Carswitch_Cars/query?x-algolia-agent=Algolia%20for%20JavaScript%20(3.33.0)%3B%20Browser&x-algolia-application-id=IH3KC909GB&x-algolia-api-key=493a9bbc57331df3b278fa39c1dd8f2d'    

        yield Request(url=url,method='POST',headers=self.headers,body=self.body,callback=self.parse)


    def parse(self,response):

        print(response.body)