在xml解析中面临org.xml.sax.SAXParseException异常

问题描述

我已经在Java Spring Boot应用程序中编写了一个调度程序,该程序每小时运行一次,因为一个月以来运行良好。但是今天它开始在解析时引发异常。我想可能是xml(我从中获取数据已损坏,或者可能它已经改变了一点,我无法弄清楚)。

请注意:我无法更改源数据。

这是我的代码

    @Scheduled(fixedrate = 1*60*60*1000,initialDelay = 10*1000)
    public String updateNewsFeed() {

        try {
            DocumentBuilder db = DocumentBuilderFactory.newInstance().newDocumentBuilder();
            String URL = "https://nation.com.pk/RSS/coronavirus";
            Document doc = db.parse(URL);
            List<NewsFeed> newsFeedList = parseNewsItemsToList(doc);
           
            return "Works fine";

        } catch (Exception ex) {
            return ex.getMessage();
        }
}

public List<NewsFeed> parseNewsItemsToList(Document doc) throws Exception{
        doc.getDocumentElement().normalize();
        NodeList nodes = doc.getElementsByTagName("item");
        List<NewsFeed> newsFeedList = new ArrayList<>();
        for (int i = 0; i < nodes.getLength(); i++) {
            Element element = (Element) nodes.item(i);

            NodeList title = element.getElementsByTagName("title");
            NodeList link = element.getElementsByTagName("link");
            NodeList description = element.getElementsByTagName("description");
            NodeList pubDate = element.getElementsByTagName("pubDate");
            NodeList guid = element.getElementsByTagName("guid");

            org.jsoup.nodes.Document htmlDoc = Jsoup.connect(link.item(0).getTextContent().trim()).get();
                /*Elements pngs = htmlDoc.select("picture");
                System.out.println("\nimg link:"+pngs.toString());*/

            String image = htmlDoc.select("picture").select("img[src~=(?i)\\.(png|jpe?g)]").attr("src").trim();
            newsFeedList.add(new NewsFeed(
                    title.item(0).getTextContent().trim(),description.item(0).getTextContent().trim(),pubDate.item(0).getTextContent().trim(),guid.item(0).getTextContent().trim(),image,link.item(0).getTextContent().trim()
            ));
        }
        return newsFeedList;
    }

这是错误消息:

[Fatal error] coronavirus:195:32: The entity name must immediately follow the '&' in the entity reference. org.xml.sax.SAXParseException; systemId: https://nation.com.pk/RSS/coronavirus; lineNumber: 195; columnNumber: 32; The entity name must immediately follow the '&' in the entity reference. at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:258) at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:339) at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:177) at com.i2p.covid19.service.NewsFeedService.updateNewsFeed(NewsFeedService.java:87) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.springframework.scheduling.support.ScheduledMethodRunnable.run(ScheduledMethodRunnable.java:84) at org.springframework.scheduling.support.DelegatingErrorHandlingRunnable.run(DelegatingErrorHandlingRunnable.java:54) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)

解决方法

问题是XML中的&和号字符。 <category>Lifestyle & Entertainment</category>

&CDATA部分之外的XML文档中是非法的。必须将其写为&amp;,但是XML文档的生产者已经转义了&字符。

如果将&替换为&amp;,它将起作用。

使用ROMETOOLS库(https://rometools.github.io/rome/ 如果您的目标是处理RSS提要,我建议使用rome库来处理&之类的特殊字符-它简单易用。请参阅https://www.baeldung.com/rome-rss

下面的代码段从RSS feed的International News标签中打印<title>

URL feedSource = new URL("https://nation.com.pk/rss/coronavirus");
SyndFeedInput input = new SyndFeedInput();
SyndFeed feed = input.build(new XmlReader(feedSource));
System.out.println(feed.getTitle());

相关问答

Selenium Web驱动程序和Java。元素在(x,y)点处不可单击。其...
Python-如何使用点“。” 访问字典成员?
Java 字符串是不可变的。到底是什么意思?
Java中的“ final”关键字如何工作?(我仍然可以修改对象。...
“loop:”在Java代码中。这是什么,为什么要编译?
java.lang.ClassNotFoundException:sun.jdbc.odbc.JdbcOdbc...