[Python-Dev] Internationalization Toolkit

M.-A. Lemburg mal@lemburg.com
Wed, 10 Nov 1999 10:04:39 +0100


Mark Hammond wrote:
> 
> > I think his proposal will go a long way towards your toolkit.  I
> hope
> > to hear soon from anybody who disagrees with Marc-Andre's proposal,
> 
> No disagreement as such, but a small hole:
> 
> >From the proposal:
> 
> Internal Argument Parsing:
> --------------------------
> ...
> 's':    For Unicode objects: auto convert them to the <default encoding>
>         and return a pointer to the object's <defencbuf> buffer.
> 
> --
> Excellent - if someone passes a Unicode object, it can be
> auto-converted to a string.  This will allow "open()" to accept
> Unicode strings.

Well almost... it depends on the current value of <default encoding>.
If it's UTF8 and you only use normal ASCII characters the above is indeed
true, but UTF8 can go far beyond ASCII and have up to 3 bytes per
character (for UCS2, even more for UCS4). With <default encoding> set
to other exotic encodings this is likely to fail though.
 
> However, there doesnt appear to be a reverse.  Eg, if my extension
> module interfaces to a library that uses Unicode natively, how can I
> get a Unicode object when the user passes a string?  If I had to
> explicitely check for a string, then check for a Unicode on failure it
> would get messy pretty quickly...  Is it not possible to have "U" also
> do a conversion?

"U" is meant to simplify checks for Unicode objects, much like "S".
It returns a reference to the object. Auto-conversions are not possible
due to this, because they would create new objects which don't get
properly garbage collected later on.

Another problem is that Unicode types differ between platforms
(MS VCLIB uses 16-bit wchar_t, while GLIBC2 uses 32-bit
wchar_t). Depending on the internal format of Unicode objects
this could mean calling different conversion APIs.

BTW, I'm still not too sure about the underlying internal format.
The problem here is that Unicode started out as 2-byte fixed length
representation (UCS2) but then shifted towards a 4-byte fixed length
reprensetation known as UCS4. Since having 4 bytes per character
is hard sell to customers, UTF16 was created to stuff the UCS4
code points (this is how character entities are called in Unicode)
into 2 bytes... with a variable length encoding.

Some platforms that started early into the Unicode business
such as the MS ones use UCS2 as wchar_t, while more recent
ones (e.g. the glibc2 on Linux) use UCS4 for wchar_t. I haven't
yet checked in what ways the two are compatible (I would suspect
the top bytes in UCS4 being 0 for UCS2 codes), but would like
to hear whether it wouldn't be a better idea to use UTF16
as internal format. The latter works in 2 bytes for most
characters and conversion to UCS2|4 should be fast. Still,
conversion to UCS2 could fail.

The downside of using UTF16: it is a variable length format,
so iterations over it will be slower than for UCS4.

Simply sticking to UCS2 is probably out of the question,
since Unicode 3.0 requires UCS4 and we are targetting
Unicode 3.0.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    51 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/