encoding="utf8" ignored when parsing XML
Peter Otten
__peter__ at web.de
Tue Dec 27 11:10:41 EST 2016
Skip Montanaro wrote:
> Peter> Isn't UTF-8 the default?
>
> Apparently not.
Sorry, I meant the default for XML.
> I believe in my reading it said that it used whatever
> locale.getpreferredencoding() returned. That's problematic when you
> live in a country that thinks ASCII is everything. Personally, I think
> UTF-8 should be the default, but that train's long left the station,
> at least for Python 2.x.
>
>> Try opening the file in binary mode then:
>>
>> with io.open(fname, "rb") as f:
>> root = xml.tree.ElementTree.parse(f).getroot()
>
> Thanks, that worked. Would appreciate an explanation of why binary
> mode was necessary. It would seem that since the file contents are
> text, just in a non-ASCII encoding, that specifying the encoding when
> opening the file should do the trick.
>
> Skip
My tentative explanation would be: If you open the file as text it will be
successfully decoded, i. e.
io.open(fname, encoding="UTF-8").read()
works, but to go back to the bytes that the XML parser needs the "preferred
encoding", in your case ASCII, will be used.
Since there are non-ascii characters you get a UnicodeEncodeError.
More information about the Python-list
mailing list