[I18n-sig] How does Python Unicode treat surrogates?

Mon, 25 Jun 2001 22:05:36 +0200

Guido van Rossum wrote:
> 
> > That's because len(u) has nothing to do with the number of
> > characters in the string, it only counts the code units (Py_UNICODEs)
> > which are used to represent characters. The same is true for normal
> > strings, e.g. UTF-8 can use between 1-4 code units (bytes in this
> > case) for a single code unit and in Unicode you can create characters
> > by combining code units
> 
> Total agreement.
> 
> > As Mark Davis pointed out:
> >
> > """In most people's experience, it is best to leave the low level interfaces
> > with indices in terms of code units, then supply some utility routines that
> > tell you information about code points. The most useful are:
> >
> > - given a string and an index into that string, how many code points are
> >   before it?
> > - given a string and a number of code points, what is the lowest index that
> >   contains them?
> 
> I understand the first and the third, but what is this one?  Is it a
> search?

Right. The difference to .find(s) is that it would return a
code point index (which can differ from the code unit index).

> > - given a string and an index into that string, is the index on a code point
> >   boundary?
> > """
> >
> > Python could use some more Unicode methods to answer these
> > questions.
> 
> Agreed (see my other post responding to Ton Emerson).
> 
> > > > Python currently only has minimal support for surrogates, so
> > > > purist would say that we support UCS-2. However, we deliberatly
> > > > chose this path to be able to upgrade to UTF-16 at some later
> > > > point in time and it seems that this time has now come.
> > >
> > > How hard would it be to also change the party line about what the
> > > encoding used is based on whether we use 2 or 4 bytes?  We could even
> > > give three choices: UCS-2 (current situation, no surrogates), UTF-16
> > > (16-bit items with some surrogate support) or UCS-4 (32-bit items)?
> >
> > Ehm... what are you getting at here ?
> 
> Earlier on you said it would be hard to offer a config-time choice
> between UTF-16 and UCS-4.  I'm still trying to figure out why. 

Here's an example of how this change affects semantics:

u = u"\U00010000"

# UTF-16
u[0] -> u"\uDC00"

# UCS-4
u[0] -> u"\U00010000"

> Given
> the additional stuff I've learned now about surrogates, it doesn't
> make sense to choose between UCS-2 and UTF-16; the surrogate handling
> can always be present.

Right.

> So let me rephrase the question.  How hard would it be to offer the
> config-time choice between UCS-4 and UTF-16? 

It would mean lot's of #ifdefs and a change in semantics.

> If it's hard, why?

It's mostly hard due to the fact that indexing, sizes and
memory management will be different for the two (e.g. dynamic
resizing vs. one time allocation). 

Codecs will have to pay attention to the difference too since UCS-4 
would not need surrogates while UTF-16 requires these.

> (I've heard you say that it's hard before, but I still don't
> understand the problem.)
> 
> > > > > I'd be happy to make the configuration choice between UTF-16 and
> > > > > UCS-4, if that's doable.
> > > >
> > > > Not easily, I'm afraid.
> > >
> > > Can you explain why this is not easy?
> >
> > Because choosing whether or not to support surrogates is a
> > fundamental choice which affects far more than just the way you
> > access storage. Surrogates introduce variable width characters:
> > some characters use two or more Py_UNICODE code units while (most)
> > others only use one.
> >
> > Remember when we discussed which internal format to use or
> > which default encoding to apply ? We ruled out UTF-8 because
> > it fails badly when it comes to slicing, concatenation, indexing,
> > etc.
> >
> > UTF-16 is much less painful as most code points only take
> > up a single code unit, but it still introduces a break in concept.
> 
> Hm, it sounds like you have the same problem that I had with Ton
> Emerson's suggestion to support Unicode before he clarified it.

No, I do understand what you mean. The "break in concept" refers
to the different ways you have to deal with variable and fixed
width representations internally (as I tried to briefly explain
above).

> If we make a clean distinction between characters and storage units,
> and if stick to the rule that u[i] accesses a storage unit, what's the
> conceptual difficulty?  There might be a separate method u.char(i)
> which returns the *character* starting u[i:], or "" if u[i] is a
> low-surrogate.  That could be all we need to support surrogates.  How
> bad is that?  (These could even continue to be supported when the
> storage uses UCS-4; there, u.char(i) would always be u[i], until
> someone comes up with a 64-bit character set. ;-)

Right... that should solve the "problem".

> > > I buy that as an argument for supporting UTF-16, but not for cutting
> > > off the road to supporting UCS-4 for those users who would like to opt
> > > in.
> >
> > That was not my point. I just wanted to point out how well UTF-16
> > is being accepted out there and that we are in good company by
> > moving from UCS-2 to UTF-16 as current internal format.
> 
> Good!  I agree.
> 
> > I don't want to cut off the road to UCS-4, I just want to make
> > clear that UTF-16 is a good choice and one which will last at
> > least some more years. We can then always decide to move on
> > to UCS-4 for the internal storage format.
> 
> Agreed again.

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/