encoding="utf8" ignored when parsing XML

Peter Otten __peter__ at web.de
Tue Dec 27 11:10:41 EST 2016


Skip Montanaro wrote:

> Peter> Isn't UTF-8 the default?
> 
> Apparently not. 

Sorry, I meant the default for XML.

> I believe in my reading it said that it used whatever
> locale.getpreferredencoding() returned. That's problematic when you
> live in a country that thinks ASCII is everything. Personally, I think
> UTF-8 should be the default, but that train's long left the station,
> at least for Python 2.x.
> 
>> Try opening the file in binary mode then:
>>
>> with io.open(fname, "rb") as f:
>>     root = xml.tree.ElementTree.parse(f).getroot()
> 
> Thanks, that worked. Would appreciate an explanation of why binary
> mode was necessary. It would seem that since the file contents are
> text, just in a non-ASCII encoding, that specifying the encoding when
> opening the file should do the trick.
> 
> Skip

My tentative explanation would be: If you open the file as text it will be 
successfully decoded, i. e.

io.open(fname, encoding="UTF-8").read()

works, but to go back to the bytes that the XML parser needs the "preferred 
encoding", in your case ASCII, will be used. 

Since there are non-ascii characters you get a UnicodeEncodeError.





More information about the Python-list mailing list