[I18n-sig] How does Python Unicode treat surrogates?

Guido van Rossum guido@digicool.com
Mon, 25 Jun 2001 12:20:23 -0400


> > Shouldn't there be a conversion routine between wchar_t[] and
> > Py_UNICODE[] instead of assuming they have the same format?  This will
> > come up more often, and Linux has sizeif(wchar_t) == 4 I believe.
> > (Which suggests that others disagree on the waste of space.)
> 
> There are conversion routines which map between Py_UNICODE
> and wchar_t in Python and these make use of the fact that
> e.g. on Windows Py_UNICODE can use wchar_t as basis which makes
> the conversion very fast.
> 
> On Linux (which uses 4 bytes per wchar_t) the routine inserts
> tons of zeros to make Tux happy :-)

Maybe this code should be restructured so that it lengthens the
characters or not depending on the size difference between Py_UNICODE
and wchar_t, rather than making platform assumptions.

If this is the only thing that keeps us from having a configuration
OPTION to make Py_UNICODE 32-bit wide, I'd say let's fix it.

> > > > > BTW, Python's Unicode implementation is bound to the standard
> > > > > defined at www.unicode.org; moving over to ISO 10646 is not an
> > > > > option.
> > > >
> > > > Can you elaborate? How can you rule out that option that easily?
> > >
> > > It is not an option because we chose Unicode as our basis for
> > > i18n work and not the ISO 10646 Uniform Character Set. I'd rather
> > > have those two camps fight over the details of the Unicode standard
> > > than try to fix anything related to the differences between the two
> > > in Python by mixing them.
> > 
> > Agreed.  But be prepared that at some point in the future the Unicode
> > world might end up agreeing on 4 bytes too...
> 
> No problem... we can change to 4 byte values too if the world
> agrees on 4 bytes per character. However, 2 bytes or 4 bytes
> is an implementation detail and not part of the Unicode standard
> itself.

But UTF-16 vs. UCS-4 is not an implementation detail!

If we store 4 bytes per character, we should treat surrogates
differently.  I don't know where those would be converted -- probably
in the UTF-16 to UCS-4 codec.

I'd be happy to make the configuration choice between UTF-16 and
UCS-4, if that's doable.

> 4 bytes per character makes things at the C level much easier
> and this is probably why the GNU C lib team chose 4 bytes. Other
> programming languages like Java and platforms like Windows
> chose 2-byte UTF-16 as internal format. I guess it's up to the
> user acceptance to choose between the two. 2 bytes means more
> work on the implementor, 4 bytes means more $$$ for Micron et al. ;-)

My 1-year old laptop has a 10 Gb hard drive and 128 Mb RAM.  Current
machines are between 2-4 times that.  How much of that space will be
wasted on extra Unicode?  For a typical user, most of it is MP3's
anyway. :-)

> > > > And why can't Python support the two standards simultaneously?
> > >
> > > Why would you want to support two standards for the same thing ?
> > 
> > Well, we support ASCII and Unicode. :-)
> > 
> > If ISO 10646 becomes important to our users, we'll have to support
> > it, if only by providing a codec.
> 
> This is different: ISO 10646 is a competing standard, not just a 
> different encoding.

Oh.  I didn't know.  How does it differ from Unicode?  What's the user
acceptance?

--Guido van Rossum (home page: http://www.python.org/~guido/)