[I18n-sig] Random thoughts on Unicode and Python

M.-A. Lemburg mal@lemburg.com
Sun, 11 Feb 2001 14:22:53 +0100

Tom Emerson wrote:
> Andy Robinson writes:
> > (1) user defined characters:  the big three Japanese encodings
> > use the Kuten space of 94x94 characters. There are lots of slight
> > venddor variations on the basic JIS0208 character set, as well
> > as people adding new Gaiji in their office workgroups. Generic
> > conversion routines from, say, EUC to Shift-JIS still work
> > perfectly whether you use Shift-JIS, cp932, or cp932 plus
> > ten extra in-house characters.  Conversions to Unicode involve
> > selecting new codecs, or even making new ones, for all these
> > situations.
> There is no reason that we couldn't provide a set of unified codecs
> for EUC-JP, Shift JIS, ISO-2022-JP, and CP932 that provide appropriate
> mappings between the EUDC sections in the legacy character sets and
> the PUA of Unicode, such that these conversions work.

> > (2) slightly corrupt data: Let's say you are dealing with files
> > or database fields containing some truncated kanji.  If you
> > use 8-bit-clean strings and no conversion, the data will not
> > be corrupted or changed; if you try to magically convert
> > it to Unicode you will get error messages or possibly even
> > more corruption.  Maybe you're writing an app whose job is
> > to get text from machine A to machine B without changing it;
> > suddenly it will stop working.  I know people who spent
> > weeks debugging a VB print spooler which was cutting up
> > Postscript files containing kanji.
> Yes, this is a problem that I cannot suggest a good answer to: reality
> raises its ugly head.

We won't be introducing new magic...
> > Suddenly upgrading to a new version of Python where all
> > your data undergoes invisible transformations to Unicode
> > and back is going to cause trouble for quite a few people.
> Absolutely.

...and the move will be slow one for sure :-)

I think that a lot of small steps are required to finally get
there and I don't want to rush anything. Still, I believe that
talking about all this now is not such a bad idea, even though
it may cause some concern about the future direction of Python.

Python's history has shown that the developers have always tried 
to maintain backward compatibility whereever possibleand feasable.
This won't change, since it is one of the most important factors 
in Python's success story and there are enough people on python-dev
who care about this a lot.

Marc-Andre Lemburg
Company:                                        http://www.egenix.com/
Consulting:                                    http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/