[I18n-sig] SIG charter and goals

M.-A. Lemburg mal@lemburg.com
Wed, 09 Feb 2000 10:34:44 +0100

Andy Robinson wrote:
> On Tue, 08 Feb 2000 15:31:43 +0100, you wrote:
> >> 2. Encodings API and library:
> >> --------------------------------
> >>
> >> We must deliver an encodings library which surpasses the features of
> >> that in Java.  It should allow conversion between many common
> >> encodings; access to Unicode character properties; and anything else
> >> which makes encoding conversion more pleasant.  This should be
> >> initially based on MAL's draft specification, although the spec may
> >> be changed if we find good reason to.
> >
> >Note that Python will have a builtin codec support. The details
> >are described in the proposal paper (not the C API though --
> >that still lives in the .h files of the Unicode implementation).
> >
> >Note that I have made some good experience with the existing
> >spec: it is very flexible, extendable and versatile. It also
> >greatly reduces coding efforts by providing working baseclasses.
> >
> I can't wait to try the code, and cannot foresee any problems at the
> moment based on the spec.  However, it was only discussed on the
> Python-dev list, and Marc-Andree was not at IPC8, so I should try to
> explain some background for everyone, (and what my agenda as SIG
> moderator is too!)
> 1. HP joined the Python consortium and pushed for Unicode support last
> year.  There was a detailed discussion on the Python-dev list (to
> which I was invited because my day-job included some very messy
> double-byte work in Python for a year).   Marc-Andre's proposal went
> through about eight iterations, and he started to code it up under
> contract to CNRI.  This is official work, and there is no question of
> anybody else's Unicode modules being used - sorry!  Fredrik Lundh's
> work on the Unicode regex engine is also under contract and
> progressing rapidly.
> 2. MAL's document defines the API for 'codecs' - conversion filters -
> but his taks does not include delivering a package with all the
> world's common encodings in it.  That is a necessity in the long run,
> and both I (through ReportLab) and Digital Garage need to make at
> least the Japanese encodings work quite soon.
> (Marc-Andre, can you update us on what codecs you are providing, and
> how they are implemented? C or Python? )

These codecs are currently included:

                       raw_unicode_escape.py  utf_16_be.py
                       unicode_escape.py      utf_16_le.py
ascii.py               unicode_internal.py    utf_8.py
latin_1.py             utf_16.py

If time permits there will also be a generic mapping codec
API which knows what to do with Python mapping tables. I'm
not sure how this will be done though... perhaps via a
subpackage of encodings which holds any number of tablename.py
modules which a special search function then finds and

You'd then write something like

u = unicode(rawdata, 'mapping-pc850')

and the search function would then scan the encodings.mapping
package for a module pc850 and use its mapping table for
the conversion.

> 3. At IPC8 we discussed (among other things) the delivery of the codec
> package - both in the i18n forum and in the corridors as usual!  To do
> what Java does, we eventually need codecs for 50+ common encodings,
> all available and tested.  These will almost certainly not be in the
> standard distribution, but there should eventually be a single,
> certified, tested source for them, as this stuff has to be 100% right.
> Quite a few of us urgently need good Japanese support.
> The current spec does not say whether codecs should be in C or Python.

It is designed to make both possible. I currently code the
converters in C and the rest in Python, which works very well
and reduces coding efforts to a minimum (the codec base classes
are designed to provide everything needed to get the most
out of a simple setup).

> Guido expressed the hope that a few carefully chosen C routines could
> allow us to write new filters in Python, but get most of the speed of
> C - an idea I'd been drip-feeding to him for some time :-)   I think
> that is a proper task for this group, and one I hope to put a lot of
> work in to.  I'm personally hoping that we can do a sort of
> mini-mxTextTools state machine which has actions for lookups in
> single-byte mapping tables, double-byte mapping tables and other
> things, so that new encodings can be written and added easily, yet
> still run fast.  For example, all single-byte encodings can be dealt
> with by a streaming version of something like string.translate(), so
> adding a new one just becomes a matter of adding a 256-element list to
> a file somewhere.  I believe most of the double-byte ones can be
> reduced to a few kb with the right functions as well.  I'll be ready
> to talk more about this shortly.

There will be a mapping based translate function or method
in the final relase which you should be able to build upon.

> Guido also made it clear that while MAL's proposal is considered
> pretty good, it is not set in stone yet. In particular, if the
> double-byte specialists find that some minor tweaks would make their
> lives better, he would consider it; we need a real-world test-drive
> before 1.6, and this group is the place to do it.

Right :-)
> Now for my own opinions on how things should be run henceforth.  Feel
> free to differ!
> I should point out that the inner circle of Python developers are NOT
> experts in multi-byte data.  I feel strongly that we should seek out
> the best expertise in the world, starting now.  This discussion will
> not focus on Unicode string implementation in the core, but on what
> our encoding library lets you do at the application level.   Ken
> Lunde, author of "CJKV Information Processing", is the acknowledged

(what does the V stand for ?)

> world leader in this field, and agreed to take part in a discussion
> and review our proposals - I'll try to bring him in shortly.  It would
> also be good to collar some people involved in the Java i18n libraries
> and ask what they would do differently next time around, and to talk
> to people who have worked with commercial tools like Unilib and
> Rosette.  Then, we won't just hope that Python has the best i18n
> support, we'll know it.  Naturally this review needs to happen fairly
> promptly in March/April - maybe best to wait until we can run the
> code.
> I hope this helps a little.  If people have serious issues about where
> things are heading, let's hear them now.
> Best Regards,
> Andy Robinson
> p.s. one thing I would be very interested to hear is what people's
> angles are - relevant experience, willingness to help out, needs for
> solutions etc!

Marc-Andre Lemburg
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/