[I18n-sig] IANA names for character set encodings?

Bill Janssen janssen@parc.xerox.com
Fri, 8 Feb 2002 15:05:34 PST


Folks,

I've been playing with the charset support in Python 2.x, and I want
to congratulate you on a great addition to the language.  It should
really be more widely advertised!  I think it makes Python the premier
language for string processing.

One thing that puzzles me, though, is the lack of support for the
standard IANA-registered names for the various charsets, as given in
http://www.iana.org/assignments/character-sets.  I notice that the file
encodings/aliases.py (in Python 2.2) does contain a few of these, but
other charsets like windows-1256 cannot be referred to by its standard
name -- it's cp1256 in Python.  This is highly counter-intuitive when
parsing HTML for instance, with "text/plain; charset=windows-1256" as
the media type.

The IANA charset table is fairly easy to parse automatically; see the
tail end of
http://cvs.plkr.org/index.cgi/parser/python/PyPlucker/helper/CharsetMapping.py?rev=HEAD&content-type=text/vnd.viewcvs-markup
for code which does so.

I'd suggest renaming the existing codecs according to their IANA
names, then adding the current names to the aliases list.

Bill