[I18n-sig] CJK codecs etc

M.-A. Lemburg mal@lemburg.com
Thu, 16 Mar 2000 11:35:04 +0100


Christian Wittern wrote:
> 
> Hi everybody,
> 
> I have some comments about CJK codecs, which are more from a user than a
> programmers perspective.
> 
> 1.) Please provide a (configurable?) fallback for failed conversions. This
> is of course especially needed for conversions out of Unicode. What I have
> in mind is, for example, provide the Unicode codepoint as entity (&U-4e00;)
> or Java escape or some such, depending on the users choice. Don't just give
> a '?', what M$'s braindead conversion routines do and thus regularily drive
> me nuts.

Please read the Misc/unicode.txt file. There are different error
handling techniques available... 'strict' (raise an error),
'ignore' (ignore the failed mapping), 'replace' (replace the
failed mapping by some codec specific replacement char, e.g. '?').

The error argument is codec specific -- the above values must
work though.
 
> 2.) On the same topic, there are some fairly frequently codepoints that map
> to different codepoints in Japanese and Taiwans encoding, although this is
> in most cases not expected. These codepoints should have been eliminated by
> Unicodes unification rules, but crept in via the source-encoding separation
> rule -- not a very good decision in my opinion. I have a list of some such
> characters at http://www.chibs.edu.tw/~chris/smart/cjkconv.htm, Ideally,
> there should be a way for the user to influence the conversion by providing
> a list of his choice (with his modifications) to the codec, to overlay the
> predefined values.

Everybody can write their own codecs... so no comment on this one ;-) 
 
> 3.) The nasty problem of user defined characters. I think there should be a
> default mapping of the user defined area in DBCS encodings to the Unicode
> code range for user characters. Microsoft uses fixed sequential tables and I
> think that is a good idea, since it is pretty straightforward. In big5 for
> example, the area of user defined characters starts at Fa40, Fa41 ..., which
> gets mapped to Unicode E000, E001, .. There should also be an option to use
> some kind of entity reference instead.

The core Python Unicode implementation doesn't touch these
private code areas at all. This issue is left to the codecs.

Since they are probably of some importance to the Asian world
due to the many corporate char sets, I guess the Asian codecs
should provide some kind of logic to handle these areas as
special cases... perhaps by passing an extra mapping table
to the codec.

> 4.) I developped years ago the habit of using entity references for any
> characters not representable in the given characterset used by the system. I
> have seen this becoming more widespread in the user communities I work with.
> It would be very useful for us, if the Unicode conversion routines in Python
> could be told to tread some arbitray entity references (we use things like
> &M24501; for the characters assigned by the Mojikyo Font Institute (see
> www.mojikyo.gr.jp) and &C4-4e21; for characters in the Taiwanese CNS
> encoding). I realize that this is a rather specialised usage, but it would
> be great and very helpful to have some hook in the system to treat this
> stuff just like any other character.

Hmm, sounds like some kind of SGML entity codec could solve this
aspect...

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/