Mailman 3 Internationalization Toolkit - Python-Dev

Nov. 10, 1999

      In general, I like this proposal a lot, but I think it
only covers half the story.  How we actually build the
encoder/decoder for each encoding is a very big issue.
 Thoughts on this below.

First, a little nit
...
u = u'<utf-8 encoded Python string>'
I don't like using funny prime characters - why not an
explicit function like "utf8()"
On to the important stuff:>
...
unicodec.register(<encname>,<encoder>,<decoder>
 [,<stream_encoder>, <stream_decoder>])
...
This registers the codecs under the given encoding
name in the module global dictionary 
unicodec.codecs. Stream codecs are optional: 
the unicodec module will provide appropriate
wrappers around <encoder> and
<decoder> if not given.
I would MUCH prefer a single 'Encoding' class or type
to wrap up these things, rather than up to four
disconnected objects/functions.  Essentially it would
be an interface standard and would offer methods to do
the four things above.  

There are several reasons for this.  
(1) there are quite a lot of things you might want to
do with an encoding object, and we could extend the
interface in future easily.  As a minimum, give it the
four methods implied by the above, two of which can be
defaults.  But I'd like an encoding to be able to tell
me the set of characters to which it applies; validate
a string; and maybe tell me if it is a subset or
superset of another.

(2) especially with double-byte encodings, they will
need to load up some kind of database on startup and
use this for both encoding and decoding - much better
to share it and encapsulate it inside one object

(3) for some languages, there are extra functions
wanted.  For Japanese, you need two or three functions
to expand half-width to full-width katakana, convert
double-byte english to single-byte and vice versa.  A
Japanese encoding object would be a handy place to put
this knowledge.

(4) In the real world you get many encodings which are
subtle variations of the same thing, plus or minus a
few characters.  One bit of code might be able to
share the work of several encodings, by setting a few
flags.  Certainly true of Japanese.

(5) encoding/decoding algorithms can be program or
data or (very often) a bit of both.  We have not yet
discussed where to keep all the mapping tables, but if
data is involved it should be hidden in an object.

(6) See my comments on a state machine for doing the
encodings.  If this is done well, we might two
different standard objects which conform to the
Encoding interface (a really light one for single-byte
encodings, and a bigger one for multi-byte), and
everything else could be data driven.  

(6) Easy to grow - encodings can be prototyped and
proven in Python, ported to C if needed or when ready.

In summary, firm up the concept of an Encoding object
and give it room to grow - that's the key to
real-world usefulness.   If people feel the same way
I'll have a go at an interface for that, and try show
how it would have simplified specific problems I have
faced.

We also need to think about where encoding info will
live.  You cannot avoid mapping tables, although you
can hide them inside code modules or pickled objects
if you want.  Should there be a standard 
"..\Python\Enc" directory?

And we're going to need some kind of testing and
certification procedure when adding new encodings. 
This stuff has to be right.  

Guido asked about TypedString.  This can probably be
done on top of the built-in stuff - it is just a
convenience which would clarify intent, reduce lines
of code and prevent people shooting themselves in the
foot when juggling a lot of strings in different
(non-Unicode) encodings.  I can do a Python module to
implement that on top of whatever is built.

Regards,

Andy

=====
Andy Robinson
Robinson Analytics Ltd.
------------------
My opinions are the official policy of Robinson Analytics Ltd.
They just vary from day to day.

__________________________________________________
Do You Yahoo!?
Bid and sell for free at http://auctions.yahoo.com

Internationalization Toolkit

Andy Robinson

M.-A. Lemburg

Fred L. Drake, Jr.

M.-A. Lemburg

Fred L. Drake, Jr.

Barry A. Warsaw

Tim Peters

M.-A. Lemburg

Fred L. Drake, Jr.

M.-A. Lemburg

Fred L. Drake, Jr.

Barry A. Warsaw

Tim Peters

tags

participants (5)