[I18n-sig] Asian Encodings

Andy Robinson andy@reportlab.com
Tue, 21 Mar 2000 17:12:21 -0000

I've been on vacation since The Big Patch - sorry about the lousy timing.  I
hope to get up a friendly tutorial on using the Unicode features shortly.

In the meantime, some thoughts on the codecs and recent conversations:

>1. Should the CJK ideograms also be included in the unicodehelpers
>numeric converters?  From my perspective, I'd really like to see them go
>in, and think that it would make sense, too - any opinions?

>2. Same as above with double-width alphanumeric characters - I assume
>these should probably also be included in the lowercase / uppercase
>helpers?  Or will there be a way to add to these lists through the codec
>API (for those worried about data from unused codecs clogging up their
>character type helpers, maybe this would be a good option to have; I
>would by contrast like to be able to exclude all the extra Latin 1 stuff
>that I don't need, hmm.)

>3. Same thing for whitespace - I think there are a number of
>double-width whitespace characters around also.

We have to be really careful about what goes in the Python core, and what is
implemented as helper layers on top, with a preference for the latter where
possible.  If we have access to the character properties database, we could
write some helper libraries which give the full range of isKatakana,
isNumeric etc. in some dynamic way, without needing them hardcoded into the
core; what we are really asking is 'does a character have a property'.  I
haven't checked the API for this yet, but if it is not there then we need

>Don't know what the standard installation method is... this
>hasn't been sorted out yet.

I'm keen to sort this out, so we can start playing with codecs.

Here's a bunch of ideas I'd like to float.  From now on, please assume I am
discussing some kind of CJK add-on package and not the Python core; it may
benefit from some helper functions in the core, but is not for everybody.

Character Sets and Encodings
Ken Lunde suggests that we should explicitly model Character Sets as
distinct from Encodings; for example, Shift-JIS is an encoding which
includes three character sets, (ASCII, JIS0208 Kanji and the Half width
katakana).   I tried to do this last year, but was not exactly sure of the
point; AFAIK it is only useful if you want to reason about whether certain
texts can survive certain round trips.  Can anyone see a need to do this
kind of thing?

Bypassing Unicode
At some level, it should be possible to write and 'register' a codec which
goes straight from, say, EUC to Shift_JIS without Unicode in the middle,
using our codec machine.  We need to figure out how this will be accessed;
what is the clean way for a user to request the codec, without complicating
or affecting anything in the present implementation.  The present
conventions of StreamWriters, StreamRecoders etc. are really useful, with or
without Unicode.

Can we overload to do codecs.lookup(sourceEncoding, destEncoding)?  Or
should it be something totally separate?

Codecs State Machine
As you know I suggested an mxTextTools-inspired mini-language for doing
stream transformations.  I've never written this kind of thing before, but
think it could be quite useful - I bet it could do data compression and
image manipulation too.  However, I have no experience designing languages.
It seems to me that we should be able to convert data faster than we can
rad/write to disk, but beyond that we need flexibility more than speed.  Now
what actions does it need?

Should we steam straight in, or prototype it in Python?
- what types?  it cannot be as flexible as Python, or it will be no faster.
Presumably most of the functions are statically typed, and we only need
bytes/character, integers and booleans
- what events when initialized ? construct mapping tables?
- read n bytes from input into a string buffer
- write n bytes from a string buffer to output
- look up 1/2/n bytes in a mapping
- full set of math and bit operators routines

One good suggestion I had from Aaron Watters was that by treating it as a
language, one could have a code-generation option as well as a runtime; we
might be able to create C code for specific encodings on demand.

Mapping tables:
For CJKV stuff I strongly favour mapping tables which are built at run time.
Mapping tables would be some of the possible inputs to our mini-language; we
would be able to write routines saying 'until byte pattern x encountered do
(read 2 bytes, look it up in a table, write the values found)', but with
user-supplied mapping tables.

These are currently implemented as dictionaries, but there are many
contiguous ranges and a compact representation is possible.  I did this last
year for a client and it worked  pretty well.  Even the big CJKV ones come
down to about 80 contiguous ranges.  Conceptually, let's imagine that bytes
1 to 5 in source encoding map to 100-105 in destination; 6-10 map to
200-205; and 11-15 map to 300-305. Then we can create a 'compact map'
structure like this...
  [(1, 5, 100),
  (6, 10, 200),
  (11, 15, 300)]
...and a routine which can expand it to a dictionary {1:100, 2:101 ....
One can also write routines to invert maps, check if they represent a round
trip and so on.  The attraction is that the definitions can be in literal
python modules, and look quite like the standards documents that create
them.  Furthermore, a lot of Japanese corporate encodings go like "Start
with strict JIS-0208, and add these extra 17 characters..." - so one module
could define all the variants for Japanese very cleanly and readably.  I
think this is a good way to tackle user-defined characters - tell them what
to hack to add theirt 50 new characters and create an encoding with a new
name.  If this sounds sensible, I'll try to start on it.

Test Harness
A digression here, but perhaps we should build a web interface to convert
arbitrary files and output as HTML, so everyone can test the output of the
codecs as we write them.  Is this useful?

That's enough rambling for one day...