使用JSOUP从URL获取实际页面和最后一页

问题描述

我正在尝试获取实际页面和最后一页，有人可以帮助我吗？

我的代码：

WITH categories_and_subcategories AS (
   SELECT id FROM category 
   WHERE id = 1 
   UNION ALL 
   SELECT c.id 
   FROM category c 
   INNER JOIN categories_and_subcategories cs 
         ON c.parentid = cs.id),filtered_products AS (
   SELECT p.id,p.name,p.catid,p.brandid 
   FROM products p
   INNER JOIN categories_and_subcategories c
         ON p.catid = c.id
   )
SELECT b.id,b.logo,b.brand,count(p.id) total
FROM brand b
LEFT JOIN filtered_products p ON p.brandid = b.id
GROUP BY b.id,b.brand

对于最后一页，使用此代码，我得到“ 0”。在附件中，我正在发送HTML代码的打印屏幕。如您所见，实际页面位于“活动页面”类中，当前值为“ 1”，最后一页的值为“ 16”。

谢谢你们！

解决方法

有用：https://jsoup.org/cookbook/extracting-data/selector-syntax（在页面下方）

我可以自由地为HTML输入更基本的字符串，但是... 您可以使用类似的方法做到这一点：


import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

public class PageExtraction {

    public static void main(String... args) {
        String url = "<ul>"
                + "<li class=\"prev\">"
                +   "<a href=\"#\"><</a>"
                + "</li>"
                + "<li class=\"page active\">"
                +   "<a href=\"#\">1</a>"
                + "</li>"
                + "<li class=\"page\">2</li>"
                + "<li class=\"page\">3</li>"
                + "<li class=\"page\">4</li>"
                + "<li class=\"page\">5</li>"
                + "<li class=\"page\">6</li>"
                + "<li class=\"page\">...</li>"
                + "<li class=\"page\">"
                +   "<a href=\"#\">16</a>"
                + "</li>"
                + "</ul>";

        Document doc = Jsoup.parse(url);
        String activePage = doc.select("[class=page active] a").text(); // elements with class "page active"
        String allPages = doc.select("li.page a").last().text(); // list elements with class page

        // TODO what if last page > some max threshold?
    }
}

html java java jsoup pagination screen-scraping