问题描述
概述
我正在开展一个项目,用于从本地剧院的网站上抓取正在播放的电影。我的目标是最终通过 JSON 将这些信息(电影名称、电影描述等)嵌入到每天早上发送的电子邮件中,让我们知道正在播放的内容,而无需实际访问他们的网站或下载他们的应用程序。
此项目的基本 URL:https://www.landmarktheatres.com/albany-ny/spectrum-8-theatres
问题
使用 htmlunit
我已经成功地从 base url 中提取了电影片名。但是,这些影片中还包括即将上映的电影,这些影片也在 base url HTML
中提供。
我需要帮助来定位正确的 HTML
。我当前的代码使用 HtmlElement
列表:
List<HtmlElement> itemList = page.getByXPath("//li[@class='gridCol-s-12 gridCol-m-4 gridCol-l-4']");
然后我循环遍历该列表以提取标题:
String title = ((HtmlElement) htmlItem.getFirstByXPath(".//div[@class='filmItemCopy']")).asText();
String titleOnly = title.substring(0,title.indexOf("\n"));
我一直在检查 HTML
并知道我需要定位:
<section class="gridRow section content">
<div class="navTabs">
<div class="navTabItem active" data-tab-item="#showing">
为了实现这一点,我很确定我需要更改我的 List<HTMLElement>
以反映这一点,但我只是没有让它发挥作用。我尝试了以下方法无济于事:
List<HtmlElement> itemList = page.getByXPath("//div[@class='navTabItem active']");
预期产出
{"title":"FOUR GOOD DAYS"}
{"title":"LIMBO"}
{"title":"DEMON SLAYER THE MOVIE: MUGEN TRAIN (SUBTITLED)"}
{"title":"DEMON SLAYER THE MOVIE: MUGEN TRAIN (DUBBED)"}
{"title":"STREET GANG: HOW WE GOT TO SESAME STREET"}
{"title":"TOGETHER TOGETHER"}
{"title":"NOMADLAND"}
{"title":"THE TRUFFLE HUNTERS"}
{"title":"THE FATHER"}
电流输出
{"title":"FOUR GOOD DAYS"}
{"title":"LIMBO"}
{"title":"DEMON SLAYER THE MOVIE: MUGEN TRAIN (SUBTITLED)"}
{"title":"DEMON SLAYER THE MOVIE: MUGEN TRAIN (DUBBED)"}
{"title":"STREET GANG: HOW WE GOT TO SESAME STREET"}
{"title":"TOGETHER TOGETHER"}
{"title":"NOMADLAND"}
{"title":"THE TRUFFLE HUNTERS"}
{"title":"THE FATHER"}
{"title":"DREAM HORSE"}
{"title":"FINAL ACCOUNT"}
{"title":"FINDING YOU"}
{"title":"THE DRY"}
{"title":"THE HUMAN FACTOR"}
{"title":"WRATH OF MAN"}
代码
SpectrumFilmItems.java
package org.example;
public class SpectrumFilmItems {
private String title;
public SpectrumFilmItems(String title) {
super();
this.title = title;
}
public String getTitle(){
return title;
}
public void setTitle(String title){
this.title = title;
}
}
SpectrumScraper.java
package org.example;
import java.util.List;
import com.fasterxml.jackson.databind.ObjectMapper;
import com.gargoylesoftware.htmlunit.SilentCssErrorHandler;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlElement;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
public class SpectrumScraper
{
public static void main( String[] args )
{
// GET request to obtain HTML content from the web server.
String baseUrl = "https://www.landmarktheatres.com/albany-ny/spectrum-8-theatres";
WebClient client = new WebClient();
client.setCssErrorHandler(new SilentCssErrorHandler());
client.getOptions().setCssEnabled(false);
client.getOptions().setJavaScriptEnabled(false);
try {
HtmlPage page = client.getPage(baseUrl);
List<HtmlElement> itemList = page.getByXPath("//li[@class='gridCol-s-12 gridCol-m-4 gridCol-l-4']");
if(itemList.isEmpty()){
System.out.println("No item found.");
}else {
for (HtmlElement htmlItem : itemList) {
String title = ((HtmlElement) htmlItem.getFirstByXPath(".//div[@class='filmItemCopy']")).asText();
String titleOnly = title.substring(0,title.indexOf("\n"));
SpectrumFilmItems filmItem = new SpectrumFilmItems(titleOnly);
ObjectMapper mapper = new ObjectMapper();
String jsonString = mapper.writeValueAsString(filmItem);
System.out.println(jsonString);
}
}
}
catch(Exception e) {
e.printStackTrace();
}
}
}
解决方法
暂无找到可以解决该程序问题的有效方法,小编努力寻找整理中!
如果你已经找到好的解决方法,欢迎将解决方案带上本链接一起发送给小编。
小编邮箱:dio#foxmail.com (将#修改为@)