[XML-SIG] Character encodings and expat

Martin v. Loewis martin@loewis.home.cs.tu-berlin.de
Fri, 27 Oct 2000 23:24:56 +0200


> Yup.  I plan to teach xmlproc the IANA registry, so that this should
> not be a problem with xmlproc.

With due respect, I hope this is not the way it that is done. Instead,
I think codecs.lookup should know the IANA registry. It may be that
this information comes with PyXML only for now, but it should be
available to all Python applications. E.g. xml/__init__.py could
do 

codecs.register(iana_lookup)

where iana_lookup simply maps encodings to the "normalized" form.

I agree with MAL that this should eventually end-up in Python proper.
In any case, knowing the official aliases should not be restricted to
xmlproc.

> However, it is a problem that Python does not support any of the Far
> East encodings yet.  Does anyone know if there are any plans to change
> that? 

Again, I'd see no problem including Tamito Kajiyama's code in PyXML,
if he wants us to ship it - or we could recommend JapaneseCodecs as an
valuable addition to PyXML; this package also uses the distutils, so
it is quite easy to install.

[using Python codecs in expat]
> I don't think it's really all that difficult.
[...]
> The only possible stumbling block is when expat discovers an XML
> declaration that says something other than "utf-16"...

Wouldn't that be the normal case where encodings other than UTF-8
become interesting? I'd assume that most XML documents which don't use
UTF-8 do declare the encoding in the XML declaration, instead of
relying on some higher-level protocol to correctly transmit encoding
information.

So I'd rather see an approach where expat itself finds out eventually
what the encoding is, and then goes to the application (i.e. the
Python SAX driver) and asks to convert the input.

Regards,
Martin