[Python-3000] XML as bytes or unicode?

"Martin v. Löwis" martin at v.loewis.de
Sun Sep 7 18:01:53 CEST 2008


>> Parsing Unicode XML strings isn't quite that meaningful.
> 
> Maybe not according to the XML standard, but I can see lots of
> practical situations where the encoding is always known and applied by
> some other layer, i.e. the I/O library or a database wrapper. Forcing
> XML to be interpreted as binary isn't always the best idea. E.g.
> consider storing XML in a SVN repository. Or consider storing XML
> fragments in Python string literals.

Stefan got it right - a "higher-level protocol" may override the
encoding declaration in the XML data. In the case of Python Unicode
strings, the data is 16-bit Unicode (or 32-bit), "obviously" overriding
the declared encoding (although technically, that protocol needs to
explicitly state what encoding takes precedence).

So let me rephrase: "Parsing Unicode XML strings may easily lead
to parsing problems" (i.e. if the parser hasn't been told that a
higher-layer protocol was in place). This is currently the case in 3.0:

py> d=xml.dom.minidom.parseString("<?xml version='1.0'
encoding='iso-8859-1'?><hallo>\u20ac</hallo>")
py> d.documentElement.childNodes[0].data
'â\x82¬'
py> list(map(ord,d.documentElement.childNodes[0].data))
[226, 130, 172]

Regards,
Martin


More information about the Python-3000 mailing list