[I18n-sig] JapaneseCodecs 1.4 released

Tamito KAJIYAMA kajiyama@grad.sccs.chukyo-u.ac.jp
Wed, 26 Sep 2001 00:38:13 +0900


Hi all,

I released JapaneseCodecs version 1.4.  The source tarball is
available at the following location:

  http://pseudo.grad.sccs.chukyo-u.ac.jp/~kajiyama/python/

The major enhancement of this release is the set of new codecs
written in C.  The performances in both speed and storage size
would be impressive as described below.  Please check it out!

Here is the result of a simple benchmark test that encodes a
Unicode string and then decodes it back.  The new codecs written
in C are much much faster than the old codecs written in Python
(time is shown in seconds).

  a Unicode string of 10,000 chars
                          in Python   in C
  japanese.euc-jp         1.074       0.003859
  japanese.shift_jis      1.059       0.003981
  japanese.iso-2022-jp    0.842       0.007737
  
  a Unicode string of 100,000 chars
                          in Python   in C
  japanese.euc-jp         11.54       0.02978
  japanese.shift_jis      11.55       0.03047
  japanese.iso-2022-jp    8.345       0.06522

  a Unicode string of 1,000,000 chars
                          in Python   in C
  japanese.euc-jp         126.7       0.2259
  japanese.shift_jis      125.9       0.2276
  japanese.iso-2022-jp    82.87       0.5892

The runtime memory size is also reduced drastically.  In the
case of a Linux box of mine, the old codecs in Python require
the runtime memory of 3,364K bytes, while the new codecs in C
occupy only 124K bytes.  In addition, the start-up time of the
Python interpreter is much shorter if one of the Japanese codecs
is used as the system default encoding.

I adopted a hashing technique in order to archive the high
performances in both speed and storage size.  Thanks Marc-Andre
for your advice (given by a couple of private messages long time
ago ;-).

Part of the program in src/_japanese_codecs.c is based on
ms932codec.c written by Atsuo ISHIMOTO.  Some helper functions
are used as they are.  I appreciate his invaluable work.

For developers of possible derived packages: Character mapping
tables in the form of hash tables are in src/_japanese_codecs.h.
This is an auto-generated file; you may want to look at the hash
table generator src/hgen.py and hash table look-up functions in
src/_japanese_codecs.c (lookup_jis_map() and lookup_ucs_map()).
If you are familiar with the programming of Python extension
modules, you will be able to apply the codes to other character
encodings such as EUC-KR and BIG-5 without trouble.  The hashing
function f() is (charcode % 523), and I heuristically chose the
divider (a prime number greater than 256).  I believe that the
value 523 is not bad in many cases.  In general, the larger the
divider is, the faster the look-up functions run, and the bigger
the hash tables are (and vise versa).  Try other prime numbers
if the resulting performances of the look-up functions and sizes
of hash tables are not desirable.

The new codecs in C are very young, and probably have a number
of bugs.  Any kind of feedback is vary appreciated.

Thank you,

-- 
KAJIYAMA, Tamito <kajiyama@grad.sccs.chukyo-u.ac.jp>