Open questions:
- If an encoding is specified, should file.read() then always return Unicode objects?
- If an encoding is specified, should file.write() only accept Unicode objects and not bytestrings?
- Is the encoding attribute mutable? (I would prefer not, but then how to apply an encoding to sys.stdout?)
Right now, codecs.open returns an instance of codecs.StreamReaderWriter, not a native file object. It has methods that look like the ones on a file, but they tpically accept or return Unicode strings instead of binary ones. This feels right to me and is what Java does; if you want to switch encoding on sys.stdout, you are not really doing anything to the file object, just switching the wrapper you use. There is much discussion on the i18n sig about 'unifying' binary and Unicode strings at the moment.
Side question: i noticed that the Lib/encodings directory supports quite a few code pages, including Greek, Russian, but there are no ISO-2022 CJK or JIS codecs. Is this just because no one felt like writing one, or is there a reason not to include one? It seems to me it might be nice to include some codecs for the most common CJK encodings -- that recent note on the popularity of Python in Korea comes to mind.
There have been 3 contributions to Asian codecs on the i18n sig in the last six months (pythoncodecs.sourceforge.net) one C, two J and one K - but some authors are uncomfortable with Python-style licenses. They need tying together into one integrated package with a test suite. After a 5-month-long project which tied me up, I have finally started ooking at this. The general feeling was that the Asian codecs package should be an optional download, but if we can get them fully tested and do some compression magic it would be nice to get them in the box one day. - Andy Robinson