How to get xml.etree.ElementTree not bomb on invalid characters in XML file ?

Stefan Behnel stefan_ml at behnel.de
Tue May 4 03:23:56 EDT 2010


Barak, Ron, 04.05.2010 09:01:
>  I'm parsing XML files using ElementTree from xml.etree (see code below
> (and attached xml_parse_example.py)).
>
> However, I'm coming across input XML files (attached an example:
> tmp.xml) which include invalid characters, that produce the following
> traceback:
>
> $ python xml_parse_example.py
> Traceback (most recent call last):
> xml.parsers.expat.ExpatError: not well-formed (invalid token): line 6, column 34

I hope you are aware that this means that the input you are parsing is not 
XML. It's best to reject the file and tell the producers that they are 
writing broken output files. You should always fix the source, instead of 
trying to make sense out of broken input in fragile ways.


> I read the documentation for xml.etree.ElementTree and see that it may
> take an optional parser parameter, but I don't know what this parser
> should be - to ignore the invalid characters.
>
> Could you suggest a way to call ElementTree, so it won't bomb on these
> invalid characters ?

No. The parser in lxml.etree has a 'recover' option that lets it try to 
recover from input errors, but in general, XML parsers are required to 
reject non well-formed input.

Stefan




More information about the Python-list mailing list