Parsing XML with ElementTree (unicode problem?)

Stefan Behnel stefan.behnel-n05pAM at web.de
Tue Jul 24 17:21:13 CEST 2007


oren.tsur at gmail.com wrote:
>> How about trying
>> root = ElementTree.parse(urlopen(query), encoding ='utf-8')

That doesn't work.


> this specific thing is not working, however, parsing the url is not
> problematic.

So you tried parsing the complete XML file and it works? Then it's the way you
stripped it down to the interesting parts that broke it. Not ElementTree's fault.


> the problem is that after parsing the xml at the url I
> save some of the fields to a local file and the local file is not
> being parsed properly due to the non-ascii characters Sauni\xc3\xa8re
> (french name: Saunière).

That looks like it parsed UTF-8 as some single byte encoding, such as
iso-8859-1. Check if the file you saved retained the XML declaration

  <?xml version="1.0" encoding="utf-8" ?>


> I'm quite new to xml and python so I guess there must be something
> wrong or dumb in the way I save the file (maybe I miss some important
> tags?) or in the way I re-open it but I can't find whats wrong.

As I said, try to read the interesting portions of the XML file
programmatically (especially if you want to do it more than once), or use an
editor that supports UTF-8 and/or XML when you edit it (i.e.: use an editor).
Make sure the XML file is well-formed (use e.g. xmllint) when you're save it.
Otherwise, no XML parser will accept it.

Stefan



More information about the Python-list mailing list