[I18n-sig] IANA names for character set encodings?
Bill Janssen
janssen@parc.xerox.com
Fri, 8 Feb 2002 15:05:34 PST
Folks,
I've been playing with the charset support in Python 2.x, and I want
to congratulate you on a great addition to the language. It should
really be more widely advertised! I think it makes Python the premier
language for string processing.
One thing that puzzles me, though, is the lack of support for the
standard IANA-registered names for the various charsets, as given in
http://www.iana.org/assignments/character-sets. I notice that the file
encodings/aliases.py (in Python 2.2) does contain a few of these, but
other charsets like windows-1256 cannot be referred to by its standard
name -- it's cp1256 in Python. This is highly counter-intuitive when
parsing HTML for instance, with "text/plain; charset=windows-1256" as
the media type.
The IANA charset table is fairly easy to parse automatically; see the
tail end of
http://cvs.plkr.org/index.cgi/parser/python/PyPlucker/helper/CharsetMapping.py?rev=HEAD&content-type=text/vnd.viewcvs-markup
for code which does so.
I'd suggest renaming the existing codecs according to their IANA
names, then adding the current names to the aliases list.
Bill