[I18n-sig] How does Python Unicode treat surrogates?

Guido van Rossum guido@digicool.com
Mon, 25 Jun 2001 15:12:31 -0400


> That's because len(u) has nothing to do with the number of 
> characters in the string, it only counts the code units (Py_UNICODEs)
> which are used to represent characters. The same is true for normal
> strings, e.g. UTF-8 can use between 1-4 code units (bytes in this
> case) for a single code unit and in Unicode you can create characters
> by combining code units 

Total agreement.

> As Mark Davis pointed out:
> 
> """In most people's experience, it is best to leave the low level interfaces
> with indices in terms of code units, then supply some utility routines that
> tell you information about code points. The most useful are:
> 
> - given a string and an index into that string, how many code points are
>   before it?
> - given a string and a number of code points, what is the lowest index that
>   contains them?

I understand the first and the third, but what is this one?  Is it a
search?

> - given a string and an index into that string, is the index on a code point
>   boundary?
> """
>  
> Python could use some more Unicode methods to answer these
> questions.

Agreed (see my other post responding to Ton Emerson).

> > > Python currently only has minimal support for surrogates, so
> > > purist would say that we support UCS-2. However, we deliberatly
> > > chose this path to be able to upgrade to UTF-16 at some later
> > > point in time and it seems that this time has now come.
> > 
> > How hard would it be to also change the party line about what the
> > encoding used is based on whether we use 2 or 4 bytes?  We could even
> > give three choices: UCS-2 (current situation, no surrogates), UTF-16
> > (16-bit items with some surrogate support) or UCS-4 (32-bit items)?
> 
> Ehm... what are you getting at here ?

Earlier on you said it would be hard to offer a config-time choice
between UTF-16 and UCS-4.  I'm still trying to figure out why.  Given
the additional stuff I've learned now about surrogates, it doesn't
make sense to choose between UCS-2 and UTF-16; the surrogate handling
can always be present.

So let me rephrase the question.  How hard would it be to offer the
config-time choice between UCS-4 and UTF-16?  If it's hard, why?
(I've heard you say that it's hard before, but I still don't
understand the problem.)

> > > > I'd be happy to make the configuration choice between UTF-16 and
> > > > UCS-4, if that's doable.
> > >
> > > Not easily, I'm afraid.
> > 
> > Can you explain why this is not easy?
> 
> Because choosing whether or not to support surrogates is a 
> fundamental choice which affects far more than just the way you
> access storage. Surrogates introduce variable width characters:
> some characters use two or more Py_UNICODE code units while (most)
> others only use one.
> 
> Remember when we discussed which internal format to use or
> which default encoding to apply ? We ruled out UTF-8 because
> it fails badly when it comes to slicing, concatenation, indexing,
> etc. 
> 
> UTF-16 is much less painful as most code points only take
> up a single code unit, but it still introduces a break in concept.

Hm, it sounds like you have the same problem that I had with Ton
Emerson's suggestion to support Unicode before he clarified it.

If we make a clean distinction between characters and storage units,
and if stick to the rule that u[i] accesses a storage unit, what's the
conceptual difficulty?  There might be a separate method u.char(i)
which returns the *character* starting u[i:], or "" if u[i] is a
low-surrogate.  That could be all we need to support surrogates.  How
bad is that?  (These could even continue to be supported when the
storage uses UCS-4; there, u.char(i) would always be u[i], until
someone comes up with a 64-bit character set. ;-)

> > I buy that as an argument for supporting UTF-16, but not for cutting
> > off the road to supporting UCS-4 for those users who would like to opt
> > in.
> 
> That was not my point. I just wanted to point out how well UTF-16
> is being accepted out there and that we are in good company by
> moving from UCS-2 to UTF-16 as current internal format.

Good!  I agree.

> I don't want to cut off the road to UCS-4, I just want to make
> clear that UTF-16 is a good choice and one which will last at
> least some more years. We can then always decide to move on
> to UCS-4 for the internal storage format.

Agreed again.

--Guido van Rossum (home page: http://www.python.org/~guido/)