[Python-Dev] XML codec?

Walter Dörwald walter at livinglogic.de
Fri Nov 9 11:51:38 CET 2007


Martin v. Löwis wrote:

>> ci = codecs.lookup("xml-auto-detect")
>> p = expat.ParserCreate()
>> e = "utf-32"
>> s = (u"<?xml version='1.0' encoding=%r?><foo/>" % e).encode(e)
>> s = ci.encode(ci.decode(s)[0], encoding="utf-8")[0]
>> p.Parse(s, True)
> 
> So how come the document being parsed is recognized as UTF-8?

Because you can force the encoder to use a specified encoding. If you do
this and the unicode string starts with an XML declaration, the encoder
will put the specified encoding into the declaration:

import codecs

e = codecs.getencoder("xml-auto-detect")
print e(u"<?xml version='1.0' encoding='iso-8859-1'?><foo/>",
encoding="utf-8")[0]

This prints:
<?xml version='1.0' encoding='utf-8'?><foo/>

>> OK, so should I put the C code into a _xml module?
> 
> I don't see the need for C code at all.

Doing the bit fiddling for
Modules/_codecsmodule.c::detect_xml_encoding_str() in C felt like the
right thing to do.

Servus,
   Walter



More information about the Python-Dev mailing list