Re: [Python-Dev] XML codec?

8 Nov 2007

      On 11/8/07, Walter Dörwald <walter@livinglogic.de> wrote:
...
Martin v. Löwis wrote:
...
...
Then how about the suggested "xml-auto-detect"?
That is better.
OK.
...
...
...
Then, I'd claim that the problem that the codec solves doesn't really
exist. IOW, most XML parsers implement the auto-detection of encodings,
anyway, and this is where architecturally this functionality belongs.
But not all XML parsers support all encodings. The XML codec makes it
trivial to add this support to an existing parser.
I would like to question this claim. Can you give an example of a parser
that doesn't support a specific encoding
It seems that e.g. expat doesn't support UTF-32:
from xml.parsers import expat
p = expat.ParserCreate()
e = "utf-32"
s = (u"<?xml version='1.0' encoding=%r?><foo/>" % e).encode(e)
p.Parse(s, True)
This fails with:
Traceback (most recent call last):
   File "gurk.py", line 6, in <module>
     p.Parse(s, True)
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1,
column 1
Replace "utf-32" with "utf-16" and the problem goes away.
...
and where adding such a codec
solves that problem?
In particular, why would that parser know how to process Python Unicode
strings?
It doesn't have to. You can use an XML encoder to reencode the unicode
string into bytes (forcing an encoding that the parser knows):
import codecs
from xml.parsers import expat
ci = codecs.lookup("xml-auto-detect")
p = expat.ParserCreate()
e = "utf-32"
s = (u"<?xml version='1.0' encoding=%r?><foo/>" % e).encode(e)
s = ci.encode(ci.decode(s)[0], encoding="utf-8")[0]
p.Parse(s, True)
...
...
Furthermore encoding-detection might be part of the responsibility of
the XML parser, but this decoding phase is totally distinct from the
parsing phase, so why not put the decoding into a common library?
I would not object to that - just to expose it as a codec. Adding it
to the XML library is fine, IMO.
But it does make sense as a codec. The decoding phase of an XML parser
has to turn a byte stream into a unicode stream. That's the job of a codec.
Yes, an XML parser should be able to use UTF-8, UTF-16, UTF-32, etc
codecs to do the encoding.  There's no need to create a magical
mystery codec to pick out which though.  It's not even sufficient for
XML:

1) round-tripping a file should be done in the original encoding.
Containing the auto-detected encoding within a codec doesn't let you
see what it picked.
2) the encoding may be specified externally from the file/stream[1].
The xml parser needs to handle these out-of-band encodings anyway.

[2] http://mail.python.org/pipermail/xml-sig/2004-October/010649.html

-- 
Adam Olsen, aka Rhamphoryncus