Mailman 3 Gracefully handling invalid XML characters when parsing documents - lxml - The Python XML Toolkit

5 Apr 2017

      Hi everyone,

I'm finding myself in a situation where I need to process XML
documents that aren't entirely valid, because they contain ASCII
control characters, such as the vertical tab (), which are not
allowed by the specification of XML1.0. The invalid characters
themselves are not important to me at all, and I'm fine with just
throwing them away from the input stream, and moving on. Other than
those characters, the XML documents are valid.

When I try to parse such an XML document with lxml with its default
settings, obviously, I get an error::

    lxml.etree.XMLSyntaxError: xmlParseCharRef: invalid xmlChar value 11, line 6, column 17

In order to silence this error, and try to recover from it, I can use
a custom parser with the “recover” option. This does get the job done
in the sense that the error no longer gets raised, but it has
significant side effects. Apparently, after the first invalid XML
character is encountered, from that point on, the parser ignores *all*
XML entities in the rest of the document.

Here's a brief code sample that demonstrates the problem::

    from lxml import etree

    broken_xml = """<?xml version="1.0"?>
    <root>
        <child>
            <something> &
        </child>
        <child></child>
        <child>
            <something> &
        </child>
    </root>
    """

    recovering_parser = etree.XMLParser(recover=True)
    broken_tree = etree.fromstring(broken_xml, parser=recovering_parser)
    print(etree.tostring(broken_tree, pretty_print=True, encoding="unicode"))

The output I get from this is the following::

    <root>
        <child>
            <something> &
        </child>
        <child/>
        <child>
            something 
        </child>
    </root>

I've scoured the docs for anything that would give me more
fine-grained control over what errors should be handled, and how, but
I haven't found anything useful. 

What I need is a tree that contains all XML entities properly, and I
don't really care about the invalid control characters. The use case
is that we're getting these invalid XML documents from MS Exchange,
where some emails happen to contain control characters in their
bodies, and ignoring all remaining entities means that all HTML bodies
turn into garbage.

Does anyone have any pointers how I can get this to work?

Cheers,

Michal

Gracefully handling invalid XML characters when parsing documents

Michal Petrucha

tags

participants (1)