[I18n-sig] thinking of CJK codec, some questions

M.-A. Lemburg mal@lemburg.com
Tue, 14 Mar 2000 10:55:24 +0100

Brian Takashi Hooper wrote:
> Should encoding support be an option to ./configure, when you are first
> building Python?  General question to everyone out there - should it be
> possible to intentionally build Python without Unicode support?

How would you do this using configure ?

As for the exclusion of Unicode: this is currently not planned.
Doing this would cause the code to become very inelegant due
to the many #ifdefs this introduces (the problem here being that
Unicode support is tightly integrated into the interpreter in
many places).
> [Tools for creating codecs from mappings]
> > Note that the module would only have to provide a simple
> > __getitem__ interface compatible object which then fetches
> > the data from the static C data. The rest can then be done
> > in Python in the same way as the other mapping codecs do their
> > job.
> Am I right in thinking that 'static C data' means something like
> static Py_UNICODE mapping[] = { ... };

> ?  Also, from a design standpoint do you (and anyone else on i18n) think
> it would be better to emphasize speed and / or memory efficiency by
> making specialized codecs for the different CJK encodings (for example,
> if a table such as the above is used, then in the case of a particular
> encoding, for example EUC, it may be possible to reduce the size of the
> table by introducing some EUC-specific casing into the encoder/decoder),
> or would it be better to try for a generalized implementation? 

How about a lib of common functions needed for CJK and then
a few small extra modules for each of the specific codecs.
Fast encoders/decoder should be done in C, the whole class
business in Python.

> We need
> something like codecs.charset_encode and codecs.charset_decode for CJK
> char sets - I was thinking that this might be best handled by a few
> separate C modules (for Japanese, one for SJIS, one for EUC, and one for
> JIS) that would in turn use similarly defined mapping modules,
> containing only one or more static conversion maps as arrays - in this
> sense I am leaning towards making tuned codecs for each encoding set.

Andy mentioned that it should be possible to write codecs
which do a couple of smaller switches and implement the other
mappings using some more intelligent logic.

The example I gave above has to be seen in the light of using the
generic mapping codec -- which probably is not very much use in a
multi-byte encoding world since it currently only supports
1-1 mappings.

I'd suggest going Andy's way for the CJK codecs... Andy ?

> I want to try to make something that many people can use - does this
> sound like a reasonable approach, or am I on the wrong track here?

Don't think so :-)

Marc-Andre Lemburg
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/