How to get xml.etree.ElementTree not bomb on invalid characters in XML file ?

Tue May 4 20:37:15 EDT 2010

On May 5, 12:11 am, "Barak, Ron" <Ron.Ba... at lsi.com> wrote:
> > -----Original Message-----
> > From: Stefan Behnel [mailto:stefan... at behnel.de]
> > Sent: Tuesday, May 04, 2010 10:24 AM
> > To: python-l... at python.org
> > Subject: Re: How to get xml.etree.ElementTree not bomb on
> > invalid characters in XML file ?
>
> > Barak, Ron, 04.05.2010 09:01:
> > >  I'm parsing XML files using ElementTree from xml.etree (see code
> > > below (and attached xml_parse_example.py)).
>
> > > However, I'm coming across input XML files (attached an example:
> > > tmp.xml) which include invalid characters, that produce the
> > following
> > > traceback:
>
> > > $ python xml_parse_example.py
> > > Traceback (most recent call last):
> > > xml.parsers.expat.ExpatError: not well-formed (invalid
> > token): line 6,
> > > column 34
>
> > I hope you are aware that this means that the input you are
> > parsing is not XML. It's best to reject the file and tell the
> > producers that they are writing broken output files. You
> > should always fix the source, instead of trying to make sense
> > out of broken input in fragile ways.
>
> > > I read the documentation for xml.etree.ElementTree and see
> > that it may
> > > take an optional parser parameter, but I don't know what
> > this parser
> > > should be - to ignore the invalid characters.
>
> > > Could you suggest a way to call ElementTree, so it won't
> > bomb on these
> > > invalid characters ?
>
> > No. The parser in lxml.etree has a 'recover' option that lets
> > it try to recover from input errors, but in general, XML
> > parsers are required to reject non well-formed input.
>
> > Stefan
>
> Hi Stefan,
> The XML file seems to be valid XML (all XML viewers I tried were able to read it).
> You can verify this by trying to read the XML example I attached to the original message (attached again here).
> Actually, when trying to view the file with an XML viewer, these offensive characters are not shown.
> It's just that some of the fields include characters that the parser used by ElementTree seems to chock on.
> Bye,
> Ron.
>
>  tmp_small.xml
> < 1KViewDownload

Have a look at your file with e.g. a hex editor or just Python repr()
-- see below. You will see that there are four cases of
    <tag>good_data\x00garbage</tag>
where "garbage" is repeated \x00 or just random line noise or
uninitialised memory.

<m_sanApiName1>"MainStorage_snap\x00\x00*SNIP*\x00\x00"</
m_sanApiName1>

<m_detail>"BROLB21\x00\xee"\x00\x00\x00\x90,\x02G\xdc\xfb\x04P\xdc
\xfb\x04\x01a\xfc>(\xe8\xfb\x04"</m_detail>

It's a toss-up whether the > in there is accidental or a deliberate
attempt to sanitise the garbage !-)

<m_detail>"Alstom\x00\x00o\x00m\x00\x00*SNIP*\x00\x00"</m_detail>

<m_sanApiVersion>"V5R1.28.1 [R - LA]\x00\x00*SNIP*\x00\x00"</
m_sanApiVersion>

The garbage in the 2nd case is such as to make the initial
declaration
    encoding="UTF-8"
an outright lie and I'm curious as to how the XML parser managed to
get as far as it did -- it must decode a line at a time.

As already advised: it's much better to reject that rubbish outright
than to attempt to repair it. Repair should be contemplated only if
it's a one-off exercise AND you can't get a fixed copy from the
source.

And while we're on the subject of rubbish: """The XML file seems to be
valid XML (all XML viewers I tried were able to read it).""" The
conclusion from that is that all XML viewers that you tried are
rubbish.