问题描述
我正在尝试从包含无效 UTF-8 字节的外部源解析 xml 文件
使用下面的java代码
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setValidating(false);
factory.setIgnoringComments(true);
factory.setNamespaceAware(false);
DocumentBuilder documentBuilder = factory.newDocumentBuilder();
try (InputStream in = getMyInputStream()) {
Document doc = documentBuilder.parse(new InputSource(in));
...
}
我收到以下异常
Caused by: org.xml.sax.SAXParseException: Invalid byte 2 of 3-byte UTF-8 sequence.
at java.xml/com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:262)
at java.xml/com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:339)
... 10 common frames omitted
Caused by: com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException: Invalid byte 2 of 3-byte UTF-8 sequence.
at java.xml/com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.invalidByte(UTF8Reader.java:702)
at java.xml/com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.read(UTF8Reader.java:409)
at java.xml/com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.load(XMLEntityScanner.java:1904)
at java.xml/com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.peekChar(XMLEntityScanner.java:508)
at java.xml/com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2649)
at java.xml/com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:605)
at java.xml/com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:534)
at java.xml/com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:888)
at java.xml/com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:824)
at java.xml/com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141)
at java.xml/com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:246)
我意识到 XML 包含无效的 UTF-8 字符,但我希望 XML 解析器能够优雅地处理这个问题,而不是抛出异常
解决方法
我通过将 java.io.Reader
传递给 DocumentBuilder
而不是 java.io.InputStream
解决了这个问题。因此,现在 DocumentBuilder
作用于字符流而不是字节流,并且不会尝试验证字节,因此不会引发异常。字节到字符的转换现在由 InputStreamReader
所以我改变了
try (InputStream in = getMyInputStream()) {
Document doc = documentBuilder.parse(new InputSource(in));
...
}
到
try (Reader reader = new InputStreamReader(getMyInputStream(),StandardCharsets.UTF_8)) {
Document doc = documentBuilder.parse(new InputSource(reader));
...
}