[Python-Dev] Internationalization Toolkit

M.-A. Lemburg mal@lemburg.com
Wed, 10 Nov 1999 12:42:00 +0100


Andy Robinson wrote:
> 
> In general, I like this proposal a lot, but I think it
> only covers half the story.  How we actually build the
> encoder/decoder for each encoding is a very big issue.
>  Thoughts on this below.
> 
> First, a little nit
> >  u = u'<utf-8 encoded Python string>'
> I don't like using funny prime characters - why not an
> explicit function like "utf8()"

u = unicode('...I am UTF8...','utf-8')

will do just that. I've moved to Tim's proposal with the
\uXXXX encoding for u'', BTW.
 
> On to the important stuff:>
> >  unicodec.register(<encname>,<encoder>,<decoder>
> >  [,<stream_encoder>, <stream_decoder>])
> 
> > This registers the codecs under the given encoding
> > name in the module global dictionary
> > unicodec.codecs. Stream codecs are optional:
> > the unicodec module will provide appropriate
> > wrappers around <encoder> and
> > <decoder> if not given.
> 
> I would MUCH prefer a single 'Encoding' class or type
> to wrap up these things, rather than up to four
> disconnected objects/functions.  Essentially it would
> be an interface standard and would offer methods to do
> the four things above.
> 
> There are several reasons for this.
>
> ...
>
> In summary, firm up the concept of an Encoding object
> and give it room to grow - that's the key to
> real-world usefulness.   If people feel the same way
> I'll have a go at an interface for that, and try show
> how it would have simplified specific problems I have
> faced.

Ok, you have a point there.

Here's a proposal (note that this only defines an interface,
not a class structure):

Codec Interface Definition:
---------------------------

The following base class should be defined in the module unicodec.

class Codec:

    def encode(self,u):
	
	""" Return the Unicode object u encoded as Python string.

	"""
	...

    def decode(self,s):

	""" Return an equivalent Unicode object for the encoded Python
	    string s.

	""" 
	...
	
    def dump(self,u,stream,slice=None):

	""" Writes the Unicode object's contents encoded to the stream.

	    stream must be a file-like object open for writing binary
	    data.

	    If slice is given (as slice object), only the sliced part
	    of the Unicode object is written.

        """
	... the base class should provide a default implementation
	    of this method using self.encode ...
	
    def load(self,stream,length=None):

	""" Reads an encoded string (up to <length> bytes) from the
	    stream and returns an equivalent Unicode object.

	    stream must be a file-like object open for reading binary
	    data.

	    If length is given, only length bytes are read. Note that
	    this can cause the decoding algorithm to fail due to
	    truncations in the encoding.

        """
	... the base class should provide a default implementation
	    of this method using self.encode ...

Codecs should raise an UnicodeError in case the conversion is
not possible.

It is not required by the unicodec.register() API to provide a
subclass of this base class, only the 4 given methods must be present.
This allows writing Codecs as extensions types.

XXX Still to be discussed: 

    · support for line breaks (see
      http://www.unicode.org/unicode/reports/tr13/ )

    · support for case conversion: 

      Problems: string lengths can change due to multiple
      characters being mapped to a single new one, capital letters
      starting a word can be different than ones occurring in the
      middle, there are locale dependent deviations from the standard
      mappings.

    · support for numbers, digits, whitespace, etc.

    · support (or no support) for private code point areas


> We also need to think about where encoding info will
> live.  You cannot avoid mapping tables, although you
> can hide them inside code modules or pickled objects
> if you want.  Should there be a standard
> "..\Python\Enc" directory?

Mapping tables should be incorporated into the codec
modules preferably as static C data. That way multiple
processes can share the same data.

> And we're going to need some kind of testing and
> certification procedure when adding new encodings.
> This stuff has to be right.

I will have to rely on your cooperation for the test data.
Roundtrip testing is easy to implement, but I will also have
to verify the output against prechecked data which is probably only
creatable using visual tools to which I don't have access
(e.g. a Japanese Windows installation).
 
> Guido asked about TypedString.  This can probably be
> done on top of the built-in stuff - it is just a
> convenience which would clarify intent, reduce lines
> of code and prevent people shooting themselves in the
> foot when juggling a lot of strings in different
> (non-Unicode) encodings.  I can do a Python module to
> implement that on top of whatever is built.

Ok.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    51 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/