[I18n-sig] thinking of CJK codec, some questions

Brian Takashi Hooper brian@garage.co.jp
Wed, 15 Mar 2000 17:51:49 +0900


Hi,

> Andy mentioned that it should be possible to write codecs
> which do a couple of smaller switches and implement the other
> mappings using some more intelligent logic.
> 
> The example I gave above has to be seen in the light of using the
> generic mapping codec -- which probably is not very much use in a
> multi-byte encoding world since it currently only supports
> 1-1 mappings.
> 
> I'd suggest going Andy's way for the CJK codecs... Andy ?

I like the idea of an encoding/decoding state machine, and have started
thinking about how this would work in the breakdown for the CJKV codecs
- what I've got is kind of like this:

The top level class interfaces, and the StreamReader/Writer classes as
well, will be in Python - I think we can probably group these generally
into modal and non-modal encoding schemes (ISO-2022-JP being an example
of the first, and EUC being an example of the second), the difference
between the two largely being a difference merely in how streams are
handled.

(Note: Andy, please pipe in if I'm misrepresenting your idea, or even if
I'm not, I'd like to know what you think about all this!)

For the encoders/decoders I like Andy's idea of trying to generalize out
a kind of 'mini-language' ala mxTextTools for specifying
encoding/decoding logic separately and then just have a generalized
engine that can generically handle multi-byte mapping tasks.  So, the
main task then is to come up with a generalization that can encompass
all of the manipulations which might be necessary in order to specify
the behavior of the mapping machine:

1. one thing it should definitely be able to do is specify a byte offset
for data in a static table.  So, for example, if I have something like:

static Py_UNICODE *euc2unicode = { 0x3000, 0x3001, ... };

I should know to start indexing from (adding 0x8080 to the first JIS
0208 character, 0x2121) 0xa1a1, that is, EUC 0xa1a2 should be converted
by looking up euc2unicode[1] => 0x3001 in Unicode.

2. another thing that it would be good to be able to do, I think, is to
be able to somehow specify which map to look in.  so, a character set
should be able to be stored in multiple, non-contiguous static arrays;
again using the example of EUC, the code set 2 zone (stuff that begins
with 8e) should refer to a different mapping table than the code set 1
stuff (the regular JIS 0208 zone for EUC-JP).  So, the encoder would be
able to say -> OK, for a character in this range, I should look up the
value at this offset into this mapping table.  For EUC-JP, this would
look like:

first character     look in table             at offset
at offset           
0x21-7e             JIS-Roman->Unicode        - 0x21
0xa1-fe             JIS 0208->Unicode         - 0x8080
0x8e                HW Katakana->Unicode      - 0x8e00
                     (from JIS-Roman)
0x8f                JIS 0212->Unicode         - 0x8080 (lookup w/
                                                  second & third bytes)

Actually, looking at this a little more, probably there should be a
way of calculating the map index given some info about the dimensions
of the map, i.e. it should be possible to set more than one offset, so
that instead of having to have a table with a lot of extra placeholding
space in it, then we know that if we have a 94x94 matrix (pretty common
in the Japanese encodings, as you know), then we can store all the data
in a 5590-element array and just index it according to our chosen
offsets.

3. coming back from Unicode

I'm wondering a little about this, since when we're coming back from
Unicode basically we have no choice (that I can think of) but to have
2^16 * (max number of bytes in target encoding), with placeholders where
there is no mapping.  So, for something like EUC-TW, which has a maximum
of 4 bytes per character, we need an encoding map 256K in size... is
there a better way, that doesn't waste so much space?  'Course, I would
hope that the Taiwanese would put enough memory in their machines (since
memory's pretty cheap there).

I guess the encoder/decoder should also know about how to do modal
encodings - I guess this is easier though if we can assume we have the
whole string, or some convenient chunk of it, to do encoding on.  Or
maybe modal and non-modal encoders/decoders should be separately
implemented (possibly, sharing utility functions)?  I still have to look
at more examples of asian encodings and especially the ISO-2022 style
ones, and vendor encodings, to get a better idea of what manipulations
they should do.

I was also thinking that the maps, to keep them separate from the
encoders/decoders themselves, would be degenerate Python modules that
would return void * pointers to the mapping tables via PyCObjects...
this seemed to me to be a good way to do maps which will primarily be
accessed by other C modules, rather than by Python... does this seem
like an OK thing?

Awaiting further enlightenment,

--Brian