如何使用网址片段回调到多个相同的网址

问题描述

我目前正在使用 Scrapy 和 Scrapy-Splash 开发应用程序。

我想为使用 JavaScript 动态创建页面的网站编写抓取代码。但是，我在同一页面上反复获取数据，但没有按预期工作...

我要抓取的网站规格如下：

我要获取的页面数据是动态创建的 JavaScript
当按下分页按钮时，调用 onclick 并且创建下一页内容。然后，显示的数据是已更新，但使用与上一页相同的网址。
显示更新数据的页面 URL 使用 url flagment(https://sample.com/category/info.php?parent=1&child=1#)。

当前遇到的问题

当我开始抓取时，控制台将如下所示：

从第 2 页开始获取所有 2 页数据

Active_No -> 1

Active_No -> 2

...

代码如下：

HTML

<ul class="a-pagination__list">
        <li>
            <a class="a-pagination--active" current="1" href="#" onclick="getCategoryList( $('#parent').val(),$('#child').val(),1,$('#sorter').val());">
            1
            </a>
        </li>
        <li>
            <a class="a-pagination" current="2" href="#" onclick="getCategoryList( $('#parent').val(),2,$('#sorter').val());">
            2
            </a>
        </li>
        <li>
            <a class="a-pagination" current="3" href="#" onclick="getCategoryList( $('#parent').val(),3,$('#sorter').val());">
            3
            </a>
        </li>
        <li>
            <a class="a-pagination" current="4" href="#" onclick="getCategoryList( $('#parent').val(),4,$('#sorter').val());">
            4
            </a>
        </li>
        <li>
            <a class="a-pagination" current="5" href="#" onclick="getCategoryList( $('#parent').val(),$('#child).val(),5,$('#sorter').val());">
            5
            </a>
        </li>
</ul>

蜘蛛/蜘蛛.py

import scrapy
from scrapy_splash import SplashRequest


class SplashTestSpider(scrapy.Spider):
    name = 'splash_test'
    allowed_domains = ['sample.com']
    start_urls = ['https://sample.com/category/info.PHP?parent=1&child=1']

    def start_requests(self):
        for url in self.start_urls:
            yield SplashRequest(url,self.parse,args={'wait': 0.5})

    def parse(self,response):

        current_page_no = response.css('.a-pagination--active::text').get()
        print('Active_No ->' + str(current_page_no))

        script = """function main(splash)
        assert (splash:go(splash.args.url))
        splash: wait(1)
        local active = splash:select('.a-pagination--active')
        local active_no = active:text()
        local next_no = active_no + 1
        local button = splash:select('.a-pagination[current="'..next_no..'"]')
        button: click()
        splash: wait(1)
        return {url = splash:url(),html = splash:html()}
        end"""

        yield SplashRequest(url=response.url,callback=self.parse,endpoint='execute',args={'lua_source': script},dont_filter=True)

Mac 卡塔利娜 10.15.7

Python 3.8.5

Scrapy 2.5.0

scrapy-splash 0.7.2

解决方法

暂无找到可以解决该程序问题的有效方法，小编努力寻找整理中！

如果你已经找到好的解决方法，欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@）

python scrapy scrapy-splash