[Python-Dev] XML codec?

Walter Dörwald walter at livinglogic.de
Fri Nov 9 11:41:28 CET 2007


Adam Olsen wrote:

> On 11/8/07, Walter Dörwald <walter at livinglogic.de> wrote:
>> [...]
>>>> Furthermore encoding-detection might be part of the responsibility of
>>>> the XML parser, but this decoding phase is totally distinct from the
>>>> parsing phase, so why not put the decoding into a common library?
>>> I would not object to that - just to expose it as a codec. Adding it
>>> to the XML library is fine, IMO.
>> But it does make sense as a codec. The decoding phase of an XML parser
>> has to turn a byte stream into a unicode stream. That's the job of a codec.
> 
> Yes, an XML parser should be able to use UTF-8, UTF-16, UTF-32, etc
> codecs to do the encoding.  There's no need to create a magical
> mystery codec to pick out which though.

So the code is good, if it is inside an XML parser, and it's bad if it
is inside a codec?

> It's not even sufficient for
> XML:
> 
> 1) round-tripping a file should be done in the original encoding.
> Containing the auto-detected encoding within a codec doesn't let you
> see what it picked.

The chosen encoding is available from the incremental encoder:

import codecs

e = codecs.getincrementalencoder("xml-auto-detect")()
e.encode(u"<?xml version='1.0' encoding='utf-32'?><foo/>", True)
print e.encoding

This prints utf-32.

> 2) the encoding may be specified externally from the file/stream[1].
> The xml parser needs to handle these out-of-band encodings anyway.

It does. You can pass an encoding to the stateless decoder, the
incremental decoder and the streamreader. It will then use this encoding
instead the one detected from the byte stream. It even will put the
correct encoding into the XML declaration (if there is one):

import codecs

d = codecs.getdecoder("xml-auto-detect")
print d("<?xml version='1.0' encoding='iso-8859-1'?><foo/>",
encoding="utf-8")[0]

This prints:
<?xml version='1.0' encoding='utf-8'?><foo/>

Servus,
   Walter


More information about the Python-Dev mailing list