[I18n-sig] Random thoughts on Unicode and Python
Tom Emerson
tree@basistech.com
Sat, 10 Feb 2001 23:06:01 -0500
Andy Robinson writes:
> (1) user defined characters: the big three Japanese encodings
> use the Kuten space of 94x94 characters. There are lots of slight
> venddor variations on the basic JIS0208 character set, as well
> as people adding new Gaiji in their office workgroups. Generic
> conversion routines from, say, EUC to Shift-JIS still work
> perfectly whether you use Shift-JIS, cp932, or cp932 plus
> ten extra in-house characters. Conversions to Unicode involve
> selecting new codecs, or even making new ones, for all these
> situations.
There is no reason that we couldn't provide a set of unified codecs
for EUC-JP, Shift JIS, ISO-2022-JP, and CP932 that provide appropriate
mappings between the EUDC sections in the legacy character sets and
the PUA of Unicode, such that these conversions work.
> (2) slightly corrupt data: Let's say you are dealing with files
> or database fields containing some truncated kanji. If you
> use 8-bit-clean strings and no conversion, the data will not
> be corrupted or changed; if you try to magically convert
> it to Unicode you will get error messages or possibly even
> more corruption. Maybe you're writing an app whose job is
> to get text from machine A to machine B without changing it;
> suddenly it will stop working. I know people who spent
> weeks debugging a VB print spooler which was cutting up
> Postscript files containing kanji.
Yes, this is a problem that I cannot suggest a good answer to: reality
raises its ugly head.
> Suddenly upgrading to a new version of Python where all
> your data undergoes invisible transformations to Unicode
> and back is going to cause trouble for quite a few people.
Absolutely.
-tree
--
Tom Emerson Basis Technology Corp.
Stringologist http://www.basistech.com
"Beware the lollipop of mediocrity: lick it once and you suck forever"