Gracefully handling invalid XML characters when parsing documents
Hi everyone, I'm finding myself in a situation where I need to process XML documents that aren't entirely valid, because they contain ASCII control characters, such as the vertical tab (), which are not allowed by the specification of XML1.0. The invalid characters themselves are not important to me at all, and I'm fine with just throwing them away from the input stream, and moving on. Other than those characters, the XML documents are valid. When I try to parse such an XML document with lxml with its default settings, obviously, I get an error:: lxml.etree.XMLSyntaxError: xmlParseCharRef: invalid xmlChar value 11, line 6, column 17 In order to silence this error, and try to recover from it, I can use a custom parser with the “recover” option. This does get the job done in the sense that the error no longer gets raised, but it has significant side effects. Apparently, after the first invalid XML character is encountered, from that point on, the parser ignores *all* XML entities in the rest of the document. Here's a brief code sample that demonstrates the problem:: from lxml import etree broken_xml = """<?xml version="1.0"?> <root> <child> <something> & </child> <child></child> <child> <something> & </child> </root> """ recovering_parser = etree.XMLParser(recover=True) broken_tree = etree.fromstring(broken_xml, parser=recovering_parser) print(etree.tostring(broken_tree, pretty_print=True, encoding="unicode")) The output I get from this is the following:: <root> <child> <something> & </child> <child/> <child> something </child> </root> I've scoured the docs for anything that would give me more fine-grained control over what errors should be handled, and how, but I haven't found anything useful. What I need is a tree that contains all XML entities properly, and I don't really care about the invalid control characters. The use case is that we're getting these invalid XML documents from MS Exchange, where some emails happen to contain control characters in their bodies, and ignoring all remaining entities means that all HTML bodies turn into garbage. Does anyone have any pointers how I can get this to work? Cheers, Michal
participants (1)
-
Michal Petrucha