[Python-Dev] XML codec?

Walter Dörwald walter at livinglogic.de
Thu Nov 8 12:54:18 CET 2007


Martin v. Löwis wrote:
>> Any comments?
> 
> -1. First, (as already discussed on the tracker,) "xml" is a bad name
> for an encoding. How would you encode "Hello" "in xml"?

Then how about the suggested "xml-auto-detect"?

> Then, I'd claim that the problem that the codec solves doesn't really
> exist. IOW, most XML parsers implement the auto-detection of encodings,
> anyway, and this is where architecturally this functionality belongs.

But not all XML parsers support all encodings. The XML codec makes it
trivial to add this support to an existing parser.

Furthermore encoding-detection might be part of the responsibility of
the XML parser, but this decoding phase is totally distinct from the
parsing phase, so why not put the decoding into a common library?

> For a text editor, much more useful than a codec would be a routine
> (say, xml.detect_encoding) which performs the auto-detection.

There's a (currently undocumented) codecs.detect_xml_encoding() in the
patch. We could document this function and make it public. But if
there's no codec that uses it, this function IMHO doesn't belong in the
codecs module. Should this function be available from xml/__init__.py or
should be put it into something like xml/utils.py?

> Finally, I think the codec is incorrect. When saving XML to a file
> (e.g. in a text editor), there should rarely be encoding errors, since
> one could use character references in many cases.

This requires some intelligent fiddling with the errors attribute of the
encoder.

> Also, the XML
> spec talks about detecting EBCDIC, which I believe your implementation
> doesn't.

Correct, but as long as Python doesn't have an EBCDIC codec, that won't
help much. Adding *detection* of EBCDIC to detect_xml_encoding() is
rather simple though.

Servus,
   Walter



More information about the Python-Dev mailing list