[I18n-sig] Changing case

Tamito KAJIYAMA kajiyama@grad.sccs.chukyo-u.ac.jp
Thu, 13 Apr 2000 02:30:50 +0900


* M.-A. Lemburg:
|
| > > To make all this work without too many hassles we'd need
| > > (at least the most commonly used) CJKV codecs in the core
| > > distribution. How big would these be ? Would someone contribute
| > > them... Tamito ?

* Andy Robinson:
|
| > He may be at home by now, but he indicated to me that he was
| > happy for them to be used in any way.  The nice things about
| > his codecs are
| > (a) one could extract the mapping tables for other codecs
| >     from data at www.unicode org and use a very similar
| >     approach.

In fact, I generated the mappings in my Japanese codecs using
simple Python scripts based on the mapping table provided by
Unicode Inc.:

ftp://ftp.unicode.org/Public/MAPPINGS/EASTASIA/JIS/JIS0208.TXT

The version I used is 0.9 (8 March 1994).  The perfectness of
the mappings are totally due to the authors of the original
mapping table, not me ;)

| > (b) the mappings may be 168k, but they at least zip nicely.
| >     I'm guessing at 5-6 such codecs in the distribution
| >     initially.

Thanks for the considerations on size.  I personally consider
the size issue is less important than the speed issue, though.

| > (c) the algorithmic bit can be accelerated later in C or our
| >     vaporware state machine, and nobody needs to change
| >     any interfaces.
| > (d) if we slightly parameterise his codecs so that one could
| >     substitute a different mapping table if needed, then
| >     all the corporate variations just need to create a
| >     new dictionary with the deltas - Microsoft Code Page
| >     932 would not be another 168k, but just a few k and
| >     could build its mapping on the fly.

Good ideas.

| > However, I suspect putting it in the core for June 1st may
| > be too aggressive; if the compiler is going to use them on
| > every source file for a Japanese user, we really want to
| > move from byte-level loops in Python to something much faster.
| 
| Speed is not an issue now: what we need is a good concept
| and some proof-of-concept code to go with it.

I think my pure Python implementation of Japanese codecs is a
kind of "proof of concept" at most.  I run a simple benchmark
test on my codecs; it took about 7 minutes to convert a 7MB
Japanese text file from EUC-JP to EUC-JP via UTF-8.  It seems
that my codecs are too slow to use for most applications.  I
believe the char-by-char iteration on strings in EUC-JP and
Shift_JIS needs to be implemented in C.

Best regards,

-- 
KAJIYAMA, Tamito <kajiyama@grad.sccs.chukyo-u.ac.jp>