[I18n-sig] thinking of CJK codec, some questions

M.-A. Lemburg mal@lemburg.com
Thu, 16 Mar 2000 11:21:29 +0100

Brian Takashi Hooper wrote:
> On Wed, 15 Mar 2000 15:36:34 +0100
> "M.-A. Lemburg" <mal@lemburg.com> wrote:
> > Just a few comments about the design (don't have any knowledge
> > about Asian encodings):
> >
> > 1. Keep large mapping tables in single automatically generated C
> >    modules that export a lookup object (ones that define __getitem__).
> >    These could also be generated using some perfect hash table
> >    generator, BTW, to reduce memory consumption.
> After researching perfect hash tables a little and thinking about it a
> little more, a question:  I think this could work well for the decoding
> maps, but for encoding (from Unicode to a legacy encoding), wouldn't I
> have to be able to detect misses in my hash lookup? 

I'd suggest using the same technique as Python: lookup the
hash(key) value and then compare the found entry (key,value) with
the looked up key. Since we are lucky, you can use the identity
function as hash function... keys still are Unicode ordinals,
but now you also store them in the mapping result (some redundance,
but better than putting them together with the keys).

> For example, if I
> had a string in Unicode that I was trying to convert to EUC-JP, and I
> looked up a Unicode character that has no mapping to EUC-JP, with a
> regular hash I my lookup will still succeed and I'll get back an EUC
> character anyway, but the wrong one...  The only way I could think of to
> avoid this would be to store the key as part of the value (or
> alternately some kind of unique checksum), and then after lookup compare
> the original key to the key that was looked up in the table; if they are
> the same, then I've got a valid mapping, and if they are different than
> my lookup failed, and I should return some kind of sentinel value
> (0xFFFF or something?).  Since the Unicode keys are all two bytes
> apiece, and for some of the largest CJK encoding standards the values
> are a max of 4 bytes long (e.g. EUC-TW), I then need to define my
> mapping table as containing values 8 bytes each in length, right?

See above.

> (assuming that I should keep the values of the array aligned along
> machine words)  Is this complication worth the space savings, I wonder?
> I think a table built this way might be a little smaller than an
> unhashed plain old table, since in a three- or four-byte encoding there
> are generally always a lot fewer mapped values than there are spaces in
> the available plane... maybe I should go with mappings as simple array
> first and then figure out how to make them smaller if it seems to matter
> a lot to people?  This is a pretty much a speed vs. space issue.

This is probably the way to go. You can always exchange the
mapping table against some other technique as long as the
interface stays the same.
I think what more important now, is focussing on the encoders
and decoders...

> Opinions?  Does anyone have a cleverer way to detect the validity of a
> hash lookup?

Marc-Andre Lemburg
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/