As you may have noticed, the latest Unicode snapshot contains a large number of new codecs. Most of them are based on a generic mapping codec which makes adding new codecs a very simple (even automated) task.
I've gotten some feedback on the compatibility of the JPython Unicode implementation (actually the underlying Java one) and the new CPython code. Finn Bock mentioned that Java uses a slightly different naming scheme and also has some differences in the code-page-to-Unicode mappings.
* Could someone provide a list of all default code pages and other encodings that Java supports ? It would be ideal to provide the same set for CPython, IMHO.
So far I've got these encodings:
cp852.py iso_8859_5.py cp855.py iso_8859_6.py ascii.py cp856.py iso_8859_7.py charmap.py cp857.py iso_8859_8.py cp037.py cp860.py iso_8859_9.py cp1006.py cp861.py koi8_r.py cp1250.py cp862.py latin_1.py cp1251.py cp863.py mac_cyrillic.py cp1252.py cp864.py mac_greek.py cp1253.py cp865.py mac_iceland.py cp1254.py cp866.py mac_latin2.py cp1255.py cp869.py mac_roman.py cp1256.py cp874.py mac_turkish.py cp1257.py iso_8859_10.py raw_unicode_escape.py cp1258.py iso_8859_13.py unicode_escape.py cp424.py iso_8859_14.py unicode_internal.py cp437.py iso_8859_15.py utf_16.py cp737.py iso_8859_2.py utf_16_be.py cp775.py iso_8859_3.py utf_16_le.py cp850.py iso_8859_4.py utf_8.py
Encoding names map to these module names in the following way:
1. convert all hyphens to underscores 2. convert all chars to lowercase 3. apply an alias dictionary to the resulting name
Thus u"abc".encode('KOI8-R') and u"abc".encode('koi8_r') will result in the same codec being used.
* There's also another issue: code pages with names cpXXXX come from two sources: IBM and MS. Unfortunately, some of these pages don't match even though they carry the same name.
Could someone verify whether the included maps work on Windows, DOS and Mac platforms as intended ? (Finn reported some divergence between the Java view of things and the maps I created from the ftp.unicode.org site ones.)