我有一个xml文件,其中一些符号由于混合了utf-16和utf-8而被错误编码。
例如?符号编码为�� ( ��
)而不是( 📞
).
我想解组这个xml文件,但是当解组器遇到这些不正确的符号时,它就失败了。如果我只用 StringEscapeUtils#unescapeHtml4
(或 StringEscapeUtils#unescapeXml
)一切正常。
但我不想把xml读入字符串,然后解码,然后解组。
如何在解组过程中执行相同的操作(之前不将xml文件读取为字符串)?
我创建了一个简单的测试来重现这一点:
public class XmlReaderTest {
private static final Pattern HTML_UNICODE_REGEX = Pattern.compile("&#[a-zA-Z0-9]+;&#[a-zA-Z0-9]+;");
@Test
public void test() throws Exception {
final Unmarshaller unmarshaller = JAXBContext.newInstance(Value.class).createUnmarshaller();
final XMLInputFactory factory = createXmlInputFactory();
String xml = "<?xml version=\"1.0\" encoding=\"UTF-8\"?><value><name>�� & 📞 Õ</name></value>";
XMLEventReader xmlReader = factory.createXMLEventReader(new StringReader(decodeHtmlEntities(xml)));
Value result = (Value)unmarshaller.unmarshal(xmlReader);
assert result.name.equals("\uD83D\uDCDE & \uD83D\uDCDE Õ");
XMLEventReader xmlReader2 = factory.createXMLEventReader(new StringReader(xml));
Value result2 = (Value)unmarshaller.unmarshal(xmlReader2); // ! exception
assert result2.name.equals("\uD83D\uDCDE & \uD83D\uDCDE Õ");
}
@XmlRootElement(name = "value")
private static class Value {
@XmlElement
public String name;
}
private String decodeHtmlEntities(String readerString) {
StringBuffer unescapedString = new StringBuffer();
Matcher regexMatcher = HTML_UNICODE_REGEX.matcher(readerString);
while (regexMatcher.find()) {
regexMatcher.appendReplacement(unescapedString, StringEscapeUtils.unescapeHtml4(regexMatcher.group()));
}
regexMatcher.appendTail(unescapedString);
return unescapedString.toString();
}
private XMLInputFactory createXmlInputFactory() {
XMLInputFactory factory = XMLInputFactory.newFactory();
factory.setProperty(XMLInputFactory.IS_SUPPORTING_EXTERNAL_ENTITIES, false);
factory.setProperty(XMLInputFactory.SUPPORT_DTD, false);
return factory;
}
}
暂无答案!
目前还没有任何答案,快来回答吧!