[I18n-sig] Asian Encodings

Wed, 22 Mar 2000 10:53:48 +0900

Hi Andy, welcome back,

On Tue, 21 Mar 2000 17:12:21 -0000
"Andy Robinson" <andy@reportlab.com> wrote:

[snip]

> Character Sets and Encodings
> ----------------------------
> Ken Lunde suggests that we should explicitly model Character Sets as
> distinct from Encodings; for example, Shift-JIS is an encoding which
> includes three character sets, (ASCII, JIS0208 Kanji and the Half width
> katakana).   I tried to do this last year, but was not exactly sure of the
> point; AFAIK it is only useful if you want to reason about whether certain
> texts can survive certain round trips.  Can anyone see a need to do this
> kind of thing?
One complication that kind of arises from this is, if you've had a look
at the mappings which are available on Unicode.org, some of them are
encoding maps and some of them are character set maps.  Which of course
by itself is not such a huge chore but makes automatically generating
maps somewhat less trivial than if you ignore such considerations.

[snip]

> Mapping tables:
> ---------------
> For CJKV stuff I strongly favour mapping tables which are built at run time.
> Mapping tables would be some of the possible inputs to our mini-language; we
> would be able to write routines saying 'until byte pattern x encountered do
> (read 2 bytes, look it up in a table, write the values found)', but with
> user-supplied mapping tables.
> 
> These are currently implemented as dictionaries, but there are many
> contiguous ranges and a compact representation is possible.  I did this last
> year for a client and it worked  pretty well.  Even the big CJKV ones come
> down to about 80 contiguous ranges.  Conceptually, let's imagine that bytes
> 1 to 5 in source encoding map to 100-105 in destination; 6-10 map to
> 200-205; and 11-15 map to 300-305. Then we can create a 'compact map'
> structure like this...
>   [(1, 5, 100),
>   (6, 10, 200),
>   (11, 15, 300)]
> ...and a routine which can expand it to a dictionary {1:100, 2:101 ....
> 15:305}.
This is similar to the way a bunch of the codecs for glibc's iconv work
- there is an index mapping table which consists of start and end
ranges, and an index, which allows a lookup function to index properly
into a big static array.

iconv, as I posted earlier, is one place that it might be good to get
ideas, both for ideas on what kinds of operations the codec machine
should be able to do and data storage.

How about making the interface to mappings simply __getitem__, as
suggested earlier on this list by Marc-Andre?  I think that might be the
best way to ensure that we have lots of different options for what we
can use as mappings.

The Java i18n classes are also worth a look - they do everything as an
inheritance hierarchy, with the logic for doing the conversion kind of
bundled together with the maps themselves - everything inherits from
either ByteToCharConverter or CharToByteConverter, and then defines a
convert routine to do conversion.  The inheritance relationships are
kind of weird, I think - like, ByteToCharEUC_JP inherits from
ByteToCharJIS0208, and contains ByteToCharJIS0201 and ByteToCharJIS0212
instances as class members.  I like how the codecs return their
max character width - this can sometimes be more than two bytes for some
asian languages and helps to know for purposes of calculating memory
allocation when going from Unicode back to a legacy encoding, for
example.  (If anyone's interested, I have decompiled copies of i18n.jar
which I can put up someplace for people to look at).

> One can also write routines to invert maps, check if they represent a round
> trip and so on.  The attraction is that the definitions can be in literal
> python modules, and look quite like the standards documents that create
> them.  Furthermore, a lot of Japanese corporate encodings go like "Start
> with strict JIS-0208, and add these extra 17 characters..." - so one module
> could define all the variants for Japanese very cleanly and readably.  I
> think this is a good way to tackle user-defined characters - tell them what
> to hack to add theirt 50 new characters and create an encoding with a new
> name.  If this sounds sensible, I'll try to start on it.
> 
> 
> Test Harness
> ------------
> A digression here, but perhaps we should build a web interface to convert
> arbitrary files and output as HTML, so everyone can test the output of the
> codecs as we write them.  Is this useful?
> 
> That's enough rambling for one day...
> 
> Thanks,
> 
> Andy
> 
> 
> _______________________________________________
> I18n-sig mailing list
> I18n-sig@python.org
> http://www.python.org/mailman/listinfo/i18n-sig
>