Re: [lxml] Gracefully handling invalid XML characters when parsing documents
On Wed, Apr 05, 2017 at 11:02:14AM -0700, Peter Van Epp wrote:
On Wed, Apr 05, 2017 at 02:47:26PM +0200, Michal Petrucha wrote:
Hi everyone,
I'm finding myself in a situation where I need to process XML documents that aren't entirely valid, because they contain ASCII control characters, such as the vertical tab (), which are not allowed by the specification of XML1.0. The invalid characters themselves are not important to me at all, and I'm fine with just throwing them away from the input stream, and moving on. Other than those characters, the XML documents are valid.
While I'm not at all an xml or lxml expert, is it not possible to run the file through a preprocessor script (perl, python, sed or what ever you have on your platform) and strip out the invalid charactes before the parser sees them to keep the parser happy?
This has crossed my mind, and it may be what I'll try to do if I don't discover a better solution, but it makes me uneasy, because there are different ways the same character can be encoded in an XML document (, , the raw character itself are just the ones I can immediately think of, but I suspect there's more magic you can do with XML entities). This is why I was hoping to handle this with a proper XML parser rather than just a regular expression. Cheers, Michal
participants (1)
-
Michal Petrucha