On Nov 9, 2007 6:10 AM, Walter Dörwald <walter@livinglogic.de> wrote:
Martin v. Löwis wrote:
Yes, an XML parser should be able to use UTF-8, UTF-16, UTF-32, etc codecs to do the encoding. There's no need to create a magical mystery codec to pick out which though. So the code is good, if it is inside an XML parser, and it's bad if it is inside a codec?
Exactly so. This functionality just *isn't* a codec - there is no encoding. Instead, it is an algorithm for *detecting* an encoding.
And what do you do once you've detected the encoding? You decode the input, so why not combine both into an XML decoder?
It seems to me that parsing XML requires 3 steps: 1) determine encoding 2) decode byte stream 3) parse XML (including handling of character references) All an xml codec does is make the first part a side-effect of the second part. Rather than this: encoding = detect_encoding(raw_data) decoded_data = raw_data.decode(encoding) tree = parse_xml(decoded_data, encoding) # Verifies encoding You'd have this: e = codecs.getincrementaldecoder("xml-auto-detect")() decoded_data = e.decode(raw_data, True) tree = parse_xml(decoded_data, e.encoding) # Verifies encoding It's clear to me that detecting an encoding is actually the simplest part of all this (so long as there's an API to do it!) Putting it inside a codec seems like the wrong subdivision of responsibility. (An example using streams would end up closer, but it still seems wrong to me. Encoding detection is always one way, while codecs are always two way (even if lossy.)) -- Adam Olsen, aka Rhamphoryncus