问题描述
概述
我正在开展一个项目,用于从本地剧院的网站上抓取正在播放的电影。我的目标是最终通过 JSON 将这些信息(电影名称、电影描述等)嵌入到每天早上发送的电子邮件中,让我们知道正在播放的内容,而无需实际访问他们的网站或下载他们的应用程序。
此项目的基本 URL:https://www.landmarktheatres.com/albany-ny/spectrum-8-theatres
问题
使用 htmlunit
我已经成功地从 base url 中提取了电影片名。但是,这些影片中还包括即将上映的电影,这些影片也在 base url HTML
中提供。
我需要帮助来定位正确的 HTML
。我当前的代码使用 HtmlElement
列表:
List<HtmlElement> itemList = page.getByXPath("//li[@class='gridCol-s-12 gridCol-m-4 gridCol-l-4']");
String title = ((HtmlElement) htmlItem.getFirstByXPath(".//div[@class='filmItemcopy']")).asText();
String titleOnly = title.substring(0,title.indexOf("\n"));
我一直在检查 HTML
并知道我需要定位:
<section class="gridRow section content">
<div class="navTabs">
<div class="navTabItem active" data-tab-item="#showing">
为了实现这一点,我很确定我需要更改我的 List<HTMLElement>
以反映这一点,但我只是没有让它发挥作用。我尝试了以下方法无济于事:
List<HtmlElement> itemList = page.getByXPath("//div[@class='navTabItem active']");
预期产出
{"title":"FOUR GOOD DAYS"}
{"title":"LIMBO"}
{"title":"DEMON SLAYER THE MOVIE: MUGEN TRAIN (SUBTITLED)"}
{"title":"DEMON SLAYER THE MOVIE: MUGEN TRAIN (dubbed)"}
{"title":"STREET GANG: HOW WE GOT TO SESAME STREET"}
{"title":"TOGETHER TOGETHER"}
{"title":"NOMADLAND"}
{"title":"THE TRUFFLE HUNTERS"}
{"title":"THE FATHER"}
电流输出
{"title":"FOUR GOOD DAYS"}
{"title":"LIMBO"}
{"title":"DEMON SLAYER THE MOVIE: MUGEN TRAIN (SUBTITLED)"}
{"title":"DEMON SLAYER THE MOVIE: MUGEN TRAIN (dubbed)"}
{"title":"STREET GANG: HOW WE GOT TO SESAME STREET"}
{"title":"TOGETHER TOGETHER"}
{"title":"NOMADLAND"}
{"title":"THE TRUFFLE HUNTERS"}
{"title":"THE FATHER"}
{"title":"DREAM HORSE"}
{"title":"FINAL ACCOUNT"}
{"title":"FINDING YOU"}
{"title":"THE DRY"}
{"title":"THE HUMAN FACTOR"}
{"title":"WRATH OF MAN"}
SpectrumFilmItems.java
package org.example;
public class SpectrumFilmItems {
private String title;
public SpectrumFilmItems(String title) {
super();
this.title = title;
}
public String getTitle(){
return title;
}
public void setTitle(String title){
this.title = title;
}
}
SpectrumScraper.java
package org.example;
import java.util.List;
import com.fasterxml.jackson.databind.ObjectMapper;
import com.gargoylesoftware.htmlunit.SilentCssErrorHandler;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlElement;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
public class SpectrumScraper
{
public static void main( String[] args )
{
// GET request to obtain HTML content from the web server.
String baseUrl = "https://www.landmarktheatres.com/albany-ny/spectrum-8-theatres";
WebClient client = new WebClient();
client.setCssErrorHandler(new SilentCssErrorHandler());
client.getoptions().setCssEnabled(false);
client.getoptions().setJavaScriptEnabled(false);
try {
HtmlPage page = client.getPage(baseUrl);
List<HtmlElement> itemList = page.getByXPath("//li[@class='gridCol-s-12 gridCol-m-4 gridCol-l-4']");
if(itemList.isEmpty()){
System.out.println("No item found.");
}else {
for (HtmlElement htmlItem : itemList) {
String title = ((HtmlElement) htmlItem.getFirstByXPath(".//div[@class='filmItemcopy']")).asText();
String titleOnly = title.substring(0,title.indexOf("\n"));
SpectrumFilmItems filmItem = new SpectrumFilmItems(titleOnly);
ObjectMapper mapper = new ObjectMapper();
String jsonString = mapper.writeValueAsstring(filmItem);
System.out.println(jsonString);
}
}
}
catch(Exception e) {
e.printstacktrace();
}
}
}
解决方法
现有电影和未上映电影的一致区别在于属性 data-film-session
和 data-film-exp
。仅当条目具有这些属性中的一个或两个时才添加到列表中。这是未经测试的,它可能不起作用,但它是朝着正确方向迈出的一步。
for (HtmlElement htmlItem : itemList) {
String dataFilmSession = htmlItem.getAttribute("data-film-session");
if (dataFilmSession.equals(DomElement.ATTRIBUTE_NOT_DEFINED) || dataFilmSession.equals(DomElement.ATTRIBUTE_VALUE_EMPTY)) {
continue;
}
// your original code
}