[I18n-sig] Random thoughts on Unicode and Python

Tom Emerson tree@basistech.com
Sat, 10 Feb 2001 23:06:01 -0500


Andy Robinson writes:
> (1) user defined characters:  the big three Japanese encodings
> use the Kuten space of 94x94 characters. There are lots of slight
> venddor variations on the basic JIS0208 character set, as well
> as people adding new Gaiji in their office workgroups. Generic
> conversion routines from, say, EUC to Shift-JIS still work
> perfectly whether you use Shift-JIS, cp932, or cp932 plus
> ten extra in-house characters.  Conversions to Unicode involve
> selecting new codecs, or even making new ones, for all these
> situations.

There is no reason that we couldn't provide a set of unified codecs
for EUC-JP, Shift JIS, ISO-2022-JP, and CP932 that provide appropriate
mappings between the EUDC sections in the legacy character sets and
the PUA of Unicode, such that these conversions work.

> (2) slightly corrupt data: Let's say you are dealing with files
> or database fields containing some truncated kanji.  If you
> use 8-bit-clean strings and no conversion, the data will not
> be corrupted or changed; if you try to magically convert
> it to Unicode you will get error messages or possibly even
> more corruption.  Maybe you're writing an app whose job is
> to get text from machine A to machine B without changing it;
> suddenly it will stop working.  I know people who spent
> weeks debugging a VB print spooler which was cutting up
> Postscript files containing kanji.

Yes, this is a problem that I cannot suggest a good answer to: reality
raises its ugly head.

> Suddenly upgrading to a new version of Python where all
> your data undergoes invisible transformations to Unicode
> and back is going to cause trouble for quite a few people.

Absolutely.

    -tree

-- 
Tom Emerson                                          Basis Technology Corp.
Stringologist                                      http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"