[Python-Dev] XML codec?
"Martin v. Löwis"
martin at v.loewis.de
Thu Nov 8 19:39:26 CET 2007
> Then how about the suggested "xml-auto-detect"?
That is better.
>> Then, I'd claim that the problem that the codec solves doesn't really
>> exist. IOW, most XML parsers implement the auto-detection of encodings,
>> anyway, and this is where architecturally this functionality belongs.
>
> But not all XML parsers support all encodings. The XML codec makes it
> trivial to add this support to an existing parser.
I would like to question this claim. Can you give an example of a parser
that doesn't support a specific encoding and where adding such a codec
solves that problem?
In particular, why would that parser know how to process Python Unicode
strings?
> Furthermore encoding-detection might be part of the responsibility of
> the XML parser, but this decoding phase is totally distinct from the
> parsing phase, so why not put the decoding into a common library?
I would not object to that - just to expose it as a codec. Adding it
to the XML library is fine, IMO.
> There's a (currently undocumented) codecs.detect_xml_encoding() in the
> patch. We could document this function and make it public. But if
> there's no codec that uses it, this function IMHO doesn't belong in the
> codecs module. Should this function be available from xml/__init__.py or
> should be put it into something like xml/utils.py?
Either - or.
>> Finally, I think the codec is incorrect. When saving XML to a file
>> (e.g. in a text editor), there should rarely be encoding errors, since
>> one could use character references in many cases.
>
> This requires some intelligent fiddling with the errors attribute of the
> encoder.
Much more than that, I think - you cannot use a character reference
in an XML Name. So the codec would have to parse the output stream
to know whether or not a character reference could be used.
> Correct, but as long as Python doesn't have an EBCDIC codec, that won't
> help much. Adding *detection* of EBCDIC to detect_xml_encoding() is
> rather simple though.
But it does! cp037 is EBCDIC, and supported by Python.
Regards,
Martin
More information about the Python-Dev
mailing list