[Python-Dev] XML codec?

Thu Nov 8 23:45:51 CET 2007

On 11/8/07, Walter Dörwald <walter at livinglogic.de> wrote:
> Martin v. Löwis wrote:
>
> >> Then how about the suggested "xml-auto-detect"?
> >
> > That is better.
>
> OK.
>
> >>> Then, I'd claim that the problem that the codec solves doesn't really
> >>> exist. IOW, most XML parsers implement the auto-detection of encodings,
> >>> anyway, and this is where architecturally this functionality belongs.
> >> But not all XML parsers support all encodings. The XML codec makes it
> >> trivial to add this support to an existing parser.
> >
> > I would like to question this claim. Can you give an example of a parser
> > that doesn't support a specific encoding
>
> It seems that e.g. expat doesn't support UTF-32:
>
> from xml.parsers import expat
>
> p = expat.ParserCreate()
> e = "utf-32"
> s = (u"<?xml version='1.0' encoding=%r?><foo/>" % e).encode(e)
> p.Parse(s, True)
>
> This fails with:
>
> Traceback (most recent call last):
>    File "gurk.py", line 6, in <module>
>      p.Parse(s, True)
> xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1,
> column 1
>
> Replace "utf-32" with "utf-16" and the problem goes away.
>
> > and where adding such a codec
> > solves that problem?
> >
> > In particular, why would that parser know how to process Python Unicode
> > strings?
>
> It doesn't have to. You can use an XML encoder to reencode the unicode
> string into bytes (forcing an encoding that the parser knows):
>
> import codecs
> from xml.parsers import expat
>
> ci = codecs.lookup("xml-auto-detect")
> p = expat.ParserCreate()
> e = "utf-32"
> s = (u"<?xml version='1.0' encoding=%r?><foo/>" % e).encode(e)
> s = ci.encode(ci.decode(s)[0], encoding="utf-8")[0]
> p.Parse(s, True)
>
> >> Furthermore encoding-detection might be part of the responsibility of
> >> the XML parser, but this decoding phase is totally distinct from the
> >> parsing phase, so why not put the decoding into a common library?
> >
> > I would not object to that - just to expose it as a codec. Adding it
> > to the XML library is fine, IMO.
>
> But it does make sense as a codec. The decoding phase of an XML parser
> has to turn a byte stream into a unicode stream. That's the job of a codec.

Yes, an XML parser should be able to use UTF-8, UTF-16, UTF-32, etc
codecs to do the encoding.  There's no need to create a magical
mystery codec to pick out which though.  It's not even sufficient for
XML:

1) round-tripping a file should be done in the original encoding.
Containing the auto-detected encoding within a codec doesn't let you
see what it picked.
2) the encoding may be specified externally from the file/stream[1].
The xml parser needs to handle these out-of-band encodings anyway.

[2] http://mail.python.org/pipermail/xml-sig/2004-October/010649.html

-- 
Adam Olsen, aka Rhamphoryncus