[I18n-sig] IANA names for character set encodings?

M.-A. Lemburg mal@lemburg.com
Sat, 09 Feb 2002 00:33:32 +0100

Bill Janssen wrote:
> Folks,
> I've been playing with the charset support in Python 2.x, and I want
> to congratulate you on a great addition to the language.  It should
> really be more widely advertised!  I think it makes Python the premier
> language for string processing.
> One thing that puzzles me, though, is the lack of support for the
> standard IANA-registered names for the various charsets, as given in
> http://www.iana.org/assignments/character-sets.  I notice that the file
> encodings/aliases.py (in Python 2.2) does contain a few of these, but
> other charsets like windows-1256 cannot be referred to by its standard
> name -- it's cp1256 in Python.  This is highly counter-intuitive when
> parsing HTML for instance, with "text/plain; charset=windows-1256" as
> the media type.
> The IANA charset table is fairly easy to parse automatically; see the
> tail end of
> http://cvs.plkr.org/index.cgi/parser/python/PyPlucker/helper/CharsetMapping.py?rev=HEAD&content-type=text/vnd.viewcvs-markup
> for code which does so.
> I'd suggest renaming the existing codecs according to their IANA
> names, then adding the current names to the aliases list.

That won't work since you can import the codec by their current
names as normal modules. However, we could add more aliases
for them if needed.

Adding all of them seems overkill though... and cumbersome, e.g.
nobody uses names like ANSI_X3.4-1968 -- us-ascii is the 
common name.

Marc-Andre Lemburg
CEO eGenix.com Software GmbH
Company & Consulting:                           http://www.egenix.com/
Python Software:                   http://www.egenix.com/files/python/