[I18n-sig] Changing case
Thu, 13 Apr 2000 02:30:50 +0900
* M.-A. Lemburg:
| > > To make all this work without too many hassles we'd need
| > > (at least the most commonly used) CJKV codecs in the core
| > > distribution. How big would these be ? Would someone contribute
| > > them... Tamito ?
* Andy Robinson:
| > He may be at home by now, but he indicated to me that he was
| > happy for them to be used in any way. The nice things about
| > his codecs are
| > (a) one could extract the mapping tables for other codecs
| > from data at www.unicode org and use a very similar
| > approach.
In fact, I generated the mappings in my Japanese codecs using
simple Python scripts based on the mapping table provided by
The version I used is 0.9 (8 March 1994). The perfectness of
the mappings are totally due to the authors of the original
mapping table, not me ;)
| > (b) the mappings may be 168k, but they at least zip nicely.
| > I'm guessing at 5-6 such codecs in the distribution
| > initially.
Thanks for the considerations on size. I personally consider
the size issue is less important than the speed issue, though.
| > (c) the algorithmic bit can be accelerated later in C or our
| > vaporware state machine, and nobody needs to change
| > any interfaces.
| > (d) if we slightly parameterise his codecs so that one could
| > substitute a different mapping table if needed, then
| > all the corporate variations just need to create a
| > new dictionary with the deltas - Microsoft Code Page
| > 932 would not be another 168k, but just a few k and
| > could build its mapping on the fly.
| > However, I suspect putting it in the core for June 1st may
| > be too aggressive; if the compiler is going to use them on
| > every source file for a Japanese user, we really want to
| > move from byte-level loops in Python to something much faster.
| Speed is not an issue now: what we need is a good concept
| and some proof-of-concept code to go with it.
I think my pure Python implementation of Japanese codecs is a
kind of "proof of concept" at most. I run a simple benchmark
test on my codecs; it took about 7 minutes to convert a 7MB
Japanese text file from EUC-JP to EUC-JP via UTF-8. It seems
that my codecs are too slow to use for most applications. I
believe the char-by-char iteration on strings in EUC-JP and
Shift_JIS needs to be implemented in C.
KAJIYAMA, Tamito <email@example.com>