[Python-Dev] Adding Japanese Codecs to the distro

22 Jan 2003 16:50:56 +0100

"M.-A. Lemburg" <mal@lemburg.com> writes:

> > Right. And we are trying to tell you that this is irrelevant when
> > talking about the size increase to be expected when JapaneseCodecs is
> > incorporated into Python.
> 
> Why is it irrelevant ? 

Because the size increase you have reported won't be the size increase
observed if JapaneseCodecs is incorporated into Python.

> It's just a hint: mapping tables are all about fast lookup vs. memory
> consumption and that's what Fredrik's approach of decomposition does
> rather well (Tamito already uses such an approach). cdb would provide
> an alternative approach, but there are licensing problems...

The trie approach in unicodedata requires that many indices have equal
entries, and that, when grouping entries into blocks, multiple blocks
can be found. 

This is not the case for CJK mappings, as there is no inherent
correlation between the code points in some CJK encoding and the
equivalent Unicode code point. In Unicode, the characters have seen
Han Unification, and are sorted according to the sorting principles of
Han Unification. In other encodings, other sorting principles have
been applied, and no unification has taken place.

Insofar chunks of the encoding are more systematic, the JapaneseCodecs
package already employs algorithmic mappings, see _japanese_codecs.c,
e.g. for the mapping of ASCII, or the 0201 halfwidth characters.

Regards,
Martin