[I18n-sig] SIG charter and goals
M.-A. Lemburg
mal@lemburg.com
Wed, 09 Feb 2000 10:34:44 +0100
Andy Robinson wrote:
>
> On Tue, 08 Feb 2000 15:31:43 +0100, you wrote:
>
> >> 2. Encodings API and library:
> >> --------------------------------
> >>
> >> We must deliver an encodings library which surpasses the features of
> >> that in Java. It should allow conversion between many common
> >> encodings; access to Unicode character properties; and anything else
> >> which makes encoding conversion more pleasant. This should be
> >> initially based on MAL's draft specification, although the spec may
> >> be changed if we find good reason to.
> >
> >Note that Python will have a builtin codec support. The details
> >are described in the proposal paper (not the C API though --
> >that still lives in the .h files of the Unicode implementation).
> >
> >Note that I have made some good experience with the existing
> >spec: it is very flexible, extendable and versatile. It also
> >greatly reduces coding efforts by providing working baseclasses.
> >
> I can't wait to try the code, and cannot foresee any problems at the
> moment based on the spec. However, it was only discussed on the
> Python-dev list, and Marc-Andree was not at IPC8, so I should try to
> explain some background for everyone, (and what my agenda as SIG
> moderator is too!)
>
> 1. HP joined the Python consortium and pushed for Unicode support last
> year. There was a detailed discussion on the Python-dev list (to
> which I was invited because my day-job included some very messy
> double-byte work in Python for a year). Marc-Andre's proposal went
> through about eight iterations, and he started to code it up under
> contract to CNRI. This is official work, and there is no question of
> anybody else's Unicode modules being used - sorry! Fredrik Lundh's
> work on the Unicode regex engine is also under contract and
> progressing rapidly.
>
> 2. MAL's document defines the API for 'codecs' - conversion filters -
> but his taks does not include delivering a package with all the
> world's common encodings in it. That is a necessity in the long run,
> and both I (through ReportLab) and Digital Garage need to make at
> least the Japanese encodings work quite soon.
>
> (Marc-Andre, can you update us on what codecs you are providing, and
> how they are implemented? C or Python? )
These codecs are currently included:
raw_unicode_escape.py utf_16_be.py
unicode_escape.py utf_16_le.py
ascii.py unicode_internal.py utf_8.py
latin_1.py utf_16.py
If time permits there will also be a generic mapping codec
API which knows what to do with Python mapping tables. I'm
not sure how this will be done though... perhaps via a
subpackage of encodings which holds any number of tablename.py
modules which a special search function then finds and
uses.
You'd then write something like
u = unicode(rawdata, 'mapping-pc850')
and the search function would then scan the encodings.mapping
package for a module pc850 and use its mapping table for
the conversion.
> 3. At IPC8 we discussed (among other things) the delivery of the codec
> package - both in the i18n forum and in the corridors as usual! To do
> what Java does, we eventually need codecs for 50+ common encodings,
> all available and tested. These will almost certainly not be in the
> standard distribution, but there should eventually be a single,
> certified, tested source for them, as this stuff has to be 100% right.
> Quite a few of us urgently need good Japanese support.
>
> The current spec does not say whether codecs should be in C or Python.
It is designed to make both possible. I currently code the
converters in C and the rest in Python, which works very well
and reduces coding efforts to a minimum (the codec base classes
are designed to provide everything needed to get the most
out of a simple setup).
> Guido expressed the hope that a few carefully chosen C routines could
> allow us to write new filters in Python, but get most of the speed of
> C - an idea I'd been drip-feeding to him for some time :-) I think
> that is a proper task for this group, and one I hope to put a lot of
> work in to. I'm personally hoping that we can do a sort of
> mini-mxTextTools state machine which has actions for lookups in
> single-byte mapping tables, double-byte mapping tables and other
> things, so that new encodings can be written and added easily, yet
> still run fast. For example, all single-byte encodings can be dealt
> with by a streaming version of something like string.translate(), so
> adding a new one just becomes a matter of adding a 256-element list to
> a file somewhere. I believe most of the double-byte ones can be
> reduced to a few kb with the right functions as well. I'll be ready
> to talk more about this shortly.
There will be a mapping based translate function or method
in the final relase which you should be able to build upon.
> Guido also made it clear that while MAL's proposal is considered
> pretty good, it is not set in stone yet. In particular, if the
> double-byte specialists find that some minor tweaks would make their
> lives better, he would consider it; we need a real-world test-drive
> before 1.6, and this group is the place to do it.
Right :-)
> Now for my own opinions on how things should be run henceforth. Feel
> free to differ!
>
> I should point out that the inner circle of Python developers are NOT
> experts in multi-byte data. I feel strongly that we should seek out
> the best expertise in the world, starting now. This discussion will
> not focus on Unicode string implementation in the core, but on what
> our encoding library lets you do at the application level. Ken
> Lunde, author of "CJKV Information Processing", is the acknowledged
(what does the V stand for ?)
> world leader in this field, and agreed to take part in a discussion
> and review our proposals - I'll try to bring him in shortly. It would
> also be good to collar some people involved in the Java i18n libraries
> and ask what they would do differently next time around, and to talk
> to people who have worked with commercial tools like Unilib and
> Rosette. Then, we won't just hope that Python has the best i18n
> support, we'll know it. Naturally this review needs to happen fairly
> promptly in March/April - maybe best to wait until we can run the
> code.
>
> I hope this helps a little. If people have serious issues about where
> things are heading, let's hear them now.
>
> Best Regards,
>
> Andy Robinson
>
> p.s. one thing I would be very interested to hear is what people's
> angles are - relevant experience, willingness to help out, needs for
> solutions etc!
--
Marc-Andre Lemburg
______________________________________________________________________
Business: http://www.lemburg.com/
Python Pages: http://www.lemburg.com/python/