[I18n-sig] CJK codecs etc

Christian Wittern chris@ccbs.ntu.edu.tw
Thu, 16 Mar 2000 12:14:08 +0800


Hi everybody,

I have some comments about CJK codecs, which are more from a user than a
programmers perspective.

1.) Please provide a (configurable?) fallback for failed conversions. This
is of course especially needed for conversions out of Unicode. What I have
in mind is, for example, provide the Unicode codepoint as entity (&U-4e00;)
or Java escape or some such, depending on the users choice. Don't just give
a '?', what M$'s braindead conversion routines do and thus regularily drive
me nuts.

2.) On the same topic, there are some fairly frequently codepoints that map
to different codepoints in Japanese and Taiwans encoding, although this is
in most cases not expected. These codepoints should have been eliminated by
Unicodes unification rules, but crept in via the source-encoding separation
rule -- not a very good decision in my opinion. I have a list of some such
characters at http://www.chibs.edu.tw/~chris/smart/cjkconv.htm, Ideally,
there should be a way for the user to influence the conversion by providing
a list of his choice (with his modifications) to the codec, to overlay the
predefined values.

3.) The nasty problem of user defined characters. I think there should be a
default mapping of the user defined area in DBCS encodings to the Unicode
code range for user characters. Microsoft uses fixed sequential tables and I
think that is a good idea, since it is pretty straightforward. In big5 for
example, the area of user defined characters starts at Fa40, Fa41 ..., which
gets mapped to Unicode E000, E001, .. There should also be an option to use
some kind of entity reference instead.

4.) I developped years ago the habit of using entity references for any
characters not representable in the given characterset used by the system. I
have seen this becoming more widespread in the user communities I work with.
It would be very useful for us, if the Unicode conversion routines in Python
could be told to tread some arbitray entity references (we use things like
&M24501; for the characters assigned by the Mojikyo Font Institute (see
www.mojikyo.gr.jp) and &C4-4e21; for characters in the Taiwanese CNS
encoding). I realize that this is a rather specialised usage, but it would
be great and very helpful to have some hook in the system to treat this
stuff just like any other character.


Any comments?

All the best,

Christian



Dr. Christian Wittern
Chung-Hwa Institute of Buddhist Studies
276, Kuang Ming Road, Peitou 112
Taipei, TAIWAN
Tel. +886-2-2892-6111#65, Email chris@ccbs.ntu.edu.tw