[Python-3000] XML as bytes or unicode?

Stefan Behnel stefan_ml at behnel.de
Mon Aug 18 20:47:26 CEST 2008


Antoine Pitrou wrote:
> I took a look at test_sax and it seems sax.parser expects all (XML) input as
> unicode rather than bytes. Apparently ElementTree does the same. Is there any
> rationale for this decision?

There can't be. Serialised XML is about bytes, not characters.

Taking lxml as a reference, there is only one case in which it allows a
unicode string as parser input, and that is when it contains no encoding
declaration at all. The rational is that the XML specification allows external
transport protocols to provide this information instead, which in the case of
a unicode string is the platform specific encoding that Python uses.
Therefore, I would not mind having support for unicode strings in the stdlib's
XML support as long as you get an encoding error for this:

    parser.feed("<?xml version='1.0' encoding='utf-8'?><root/>") # Py3

I doubt that that's currently the case, though...

I also saw in the test that the XMLGenerator can serialise into a StringIO. At
least serialising to a byte encoding should fail when the target is a
StringIO, right?

Stefan



More information about the Python-3000 mailing list