
Adam Olsen wrote:
On 11/8/07, Walter Dörwald <walter@livinglogic.de> wrote:
[...]
Furthermore encoding-detection might be part of the responsibility of the XML parser, but this decoding phase is totally distinct from the parsing phase, so why not put the decoding into a common library? I would not object to that - just to expose it as a codec. Adding it to the XML library is fine, IMO. But it does make sense as a codec. The decoding phase of an XML parser has to turn a byte stream into a unicode stream. That's the job of a codec.
Yes, an XML parser should be able to use UTF-8, UTF-16, UTF-32, etc codecs to do the encoding. There's no need to create a magical mystery codec to pick out which though.
So the code is good, if it is inside an XML parser, and it's bad if it is inside a codec?
It's not even sufficient for XML:
1) round-tripping a file should be done in the original encoding. Containing the auto-detected encoding within a codec doesn't let you see what it picked.
The chosen encoding is available from the incremental encoder: import codecs e = codecs.getincrementalencoder("xml-auto-detect")() e.encode(u"<?xml version='1.0' encoding='utf-32'?><foo/>", True) print e.encoding This prints utf-32.
2) the encoding may be specified externally from the file/stream[1]. The xml parser needs to handle these out-of-band encodings anyway.
It does. You can pass an encoding to the stateless decoder, the incremental decoder and the streamreader. It will then use this encoding instead the one detected from the byte stream. It even will put the correct encoding into the XML declaration (if there is one): import codecs d = codecs.getdecoder("xml-auto-detect") print d("<?xml version='1.0' encoding='iso-8859-1'?><foo/>", encoding="utf-8")[0] This prints: <?xml version='1.0' encoding='utf-8'?><foo/> Servus, Walter