On 11/8/07, Walter Dörwald <walter@livinglogic.de> wrote:
Martin v. Löwis wrote:
Then how about the suggested "xml-auto-detect"?
That is better.
OK.
Then, I'd claim that the problem that the codec solves doesn't really exist. IOW, most XML parsers implement the auto-detection of encodings, anyway, and this is where architecturally this functionality belongs. But not all XML parsers support all encodings. The XML codec makes it trivial to add this support to an existing parser.
I would like to question this claim. Can you give an example of a parser that doesn't support a specific encoding
It seems that e.g. expat doesn't support UTF-32:
from xml.parsers import expat
p = expat.ParserCreate() e = "utf-32" s = (u"<?xml version='1.0' encoding=%r?><foo/>" % e).encode(e) p.Parse(s, True)
This fails with:
Traceback (most recent call last): File "gurk.py", line 6, in <module> p.Parse(s, True) xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1, column 1
Replace "utf-32" with "utf-16" and the problem goes away.
and where adding such a codec solves that problem?
In particular, why would that parser know how to process Python Unicode strings?
It doesn't have to. You can use an XML encoder to reencode the unicode string into bytes (forcing an encoding that the parser knows):
import codecs from xml.parsers import expat
ci = codecs.lookup("xml-auto-detect") p = expat.ParserCreate() e = "utf-32" s = (u"<?xml version='1.0' encoding=%r?><foo/>" % e).encode(e) s = ci.encode(ci.decode(s)[0], encoding="utf-8")[0] p.Parse(s, True)
Furthermore encoding-detection might be part of the responsibility of the XML parser, but this decoding phase is totally distinct from the parsing phase, so why not put the decoding into a common library?
I would not object to that - just to expose it as a codec. Adding it to the XML library is fine, IMO.
But it does make sense as a codec. The decoding phase of an XML parser has to turn a byte stream into a unicode stream. That's the job of a codec.
Yes, an XML parser should be able to use UTF-8, UTF-16, UTF-32, etc codecs to do the encoding. There's no need to create a magical mystery codec to pick out which though. It's not even sufficient for XML: 1) round-tripping a file should be done in the original encoding. Containing the auto-detected encoding within a codec doesn't let you see what it picked. 2) the encoding may be specified externally from the file/stream[1]. The xml parser needs to handle these out-of-band encodings anyway. [2] http://mail.python.org/pipermail/xml-sig/2004-October/010649.html -- Adam Olsen, aka Rhamphoryncus