[Python-3000] XML as bytes or unicode?

Wed Aug 27 19:30:06 CEST 2008

Guido van Rossum wrote:
> 2008/8/24 "Martin v. Löwis" <martin at v.loewis.de>:
>> Parsing Unicode XML strings isn't quite that meaningful.
> 
> Maybe not according to the XML standard, but I can see lots of
> practical situations where the encoding is always known and applied by
> some other layer, i.e. the I/O library or a database wrapper. Forcing
> XML to be interpreted as binary isn't always the best idea. E.g.
> consider storing XML in a SVN repository. Or consider storing XML
> fragments in Python string literals.

lxml handles XML data in unicode strings nicely. The reasoning is that the XML
spec says in 4.3.3:

"""
In the absence of information provided by an external transport protocol (e.g.
HTTP or MIME), it is a fatal error for an entity including an encoding
declaration to be presented to the XML processor in an encoding other than
that named in the declaration [...]
"""

On a given platform, the internal encoding of a Python unicode string is well
defined, which means it is as good as an encoding provided by a transport
protocol. So this works as long as the XML content of the unicode string does
not specify a wrong encoding itself (in which case the parser must reject it).

Another reason why lxml handles this is that it also has great support for
HTML. In the HTML world, unicode data is a lot easier to handle than the
average byte encoded page that doesn't provide any encoding information at all.

Stefan