[Python-Dev] XML codec?

Fri Nov 9 21:35:11 CET 2007

On Nov 9, 2007 6:10 AM, Walter Dörwald <walter at livinglogic.de> wrote:
>
> Martin v. Löwis wrote:
> >>> Yes, an XML parser should be able to use UTF-8, UTF-16, UTF-32, etc
> >>> codecs to do the encoding.  There's no need to create a magical
> >>> mystery codec to pick out which though.
> >> So the code is good, if it is inside an XML parser, and it's bad if it
> >> is inside a codec?
> >
> > Exactly so. This functionality just *isn't* a codec - there is no
> > encoding. Instead, it is an algorithm for *detecting* an encoding.
>
> And what do you do once you've detected the encoding? You decode the
> input, so why not combine both into an XML decoder?

It seems to me that parsing XML requires 3 steps:
1) determine encoding
2) decode byte stream
3) parse XML (including handling of character references)

All an xml codec does is make the first part a side-effect of the
second part.  Rather than this:

encoding = detect_encoding(raw_data)
decoded_data = raw_data.decode(encoding)
tree = parse_xml(decoded_data, encoding)  # Verifies encoding

You'd have this:

e = codecs.getincrementaldecoder("xml-auto-detect")()
decoded_data = e.decode(raw_data, True)
tree = parse_xml(decoded_data, e.encoding)  # Verifies encoding

It's clear to me that detecting an encoding is actually the simplest
part of all this (so long as there's an API to do it!)  Putting it
inside a codec seems like the wrong subdivision of responsibility.

(An example using streams would end up closer, but it still seems
wrong to me.  Encoding detection is always one way, while codecs are
always two way (even if lossy.))

-- 
Adam Olsen, aka Rhamphoryncus