HtmlUnit WebClient 方法 getPage 总是返回相同的页面

问题描述

我目前正在研究从特定网站抓取数据的解决方案（使用 css 选择器从网站的 html 表中返回价格列表）。为了做到这一点，我决定使用 HtmlUnit 库，因为我看到它支持很多功能。在我完成代码并针对同一页面（使用相同的搜索参数）对其进行测试后，我以为我已经完成了，但是在为多个页面启动多个线程后，一切都发生了变化。问题是基本上下面的代码总是为所有线程返回相同的旧页面，我根本不明白这种行为：

page = client.getPage(webPageURL); // always returns the same old page source

我使用的是同一个网站，我只是更改了一些搜索参数。

这些是我的代码的一些部分：

final WebClient client = new WebClient(browserVersion.CHROME);  
            client.getoptions().setCssEnabled(false);  
            client.getoptions().setJavaScriptEnabled(true);
            client.setAjaxController(new com.gargoylesoftware.htmlunit.NicelyResynchronizingAjaxController());
            client.getoptions().setThrowExceptionOnFailingStatusCode(false);
            client.getoptions().setThrowExceptionOnScriptError(false);
            client.addRequestHeader("Cache-Control","no-cache,no-store,must-revalidate");
            client.addRequestHeader("Pragma","no-cache");
            client.addRequestHeader("Expires","0"); //
            client.getCache().clear();
            client.getCache().clearOutdated();
            client.getCache().setMaxSize(0);
            
            // enable sessions
            client.getCookieManager().setCookiesEnabled(true);

检索页面的位置：

HtmlPage page = null;
            
            try{
                //client.closeAllWindows();
                page = client.getPage(webPageURL);
                WebResponse response = page.getWebResponse();
                pageAsstring= response.getContentAsstring();

正如您所注意到的，我尝试大量使用缓存（因为我认为这是问题所在，对吗？）并通过将页面打印为字符串 (pageAsstring) 进行了一些调试。无论我在页面 url 中更改了多少次搜索参数，都没有任何变化。我总是得到相同的旧页面。

我还尝试了一些删除所有作业或清理页面的方法。不用说，这一切都没有奏效：

finally {
                client.getCurrentwindow().getJobManager().removeAllJobs();
                page.cleanUp();
                client.close();
                client.getCurrentwindow().getJobManager().shutdown();
                //client.closeAllWindows();
                //System.gc();
                }

你知道我是如何弄乱代码的，我总是得到相同的缓存页面吗？

先谢谢你， n23

解决方法

似乎是与 HtmlUnit 无关的代理问题（请参阅 https://github.com/HtmlUnit/htmlunit/issues/327 了解更多详情）