Parsing XML with ElementTree (unicode problem?)
stefan.behnel-n05pAM at web.de
Tue Jul 24 17:21:13 CEST 2007
oren.tsur at gmail.com wrote:
>> How about trying
>> root = ElementTree.parse(urlopen(query), encoding ='utf-8')
That doesn't work.
> this specific thing is not working, however, parsing the url is not
So you tried parsing the complete XML file and it works? Then it's the way you
stripped it down to the interesting parts that broke it. Not ElementTree's fault.
> the problem is that after parsing the xml at the url I
> save some of the fields to a local file and the local file is not
> being parsed properly due to the non-ascii characters Sauni\xc3\xa8re
> (french name: Saunière).
That looks like it parsed UTF-8 as some single byte encoding, such as
iso-8859-1. Check if the file you saved retained the XML declaration
<?xml version="1.0" encoding="utf-8" ?>
> I'm quite new to xml and python so I guess there must be something
> wrong or dumb in the way I save the file (maybe I miss some important
> tags?) or in the way I re-open it but I can't find whats wrong.
As I said, try to read the interesting portions of the XML file
programmatically (especially if you want to do it more than once), or use an
editor that supports UTF-8 and/or XML when you edit it (i.e.: use an editor).
Make sure the XML file is well-formed (use e.g. xmllint) when you're save it.
Otherwise, no XML parser will accept it.
More information about the Python-list