[I18n-sig] How does Python Unicode treat surrogates?
Guido van Rossum
guido@digicool.com
Mon, 25 Jun 2001 15:12:31 -0400
> That's because len(u) has nothing to do with the number of
> characters in the string, it only counts the code units (Py_UNICODEs)
> which are used to represent characters. The same is true for normal
> strings, e.g. UTF-8 can use between 1-4 code units (bytes in this
> case) for a single code unit and in Unicode you can create characters
> by combining code units
Total agreement.
> As Mark Davis pointed out:
>
> """In most people's experience, it is best to leave the low level interfaces
> with indices in terms of code units, then supply some utility routines that
> tell you information about code points. The most useful are:
>
> - given a string and an index into that string, how many code points are
> before it?
> - given a string and a number of code points, what is the lowest index that
> contains them?
I understand the first and the third, but what is this one? Is it a
search?
> - given a string and an index into that string, is the index on a code point
> boundary?
> """
>
> Python could use some more Unicode methods to answer these
> questions.
Agreed (see my other post responding to Ton Emerson).
> > > Python currently only has minimal support for surrogates, so
> > > purist would say that we support UCS-2. However, we deliberatly
> > > chose this path to be able to upgrade to UTF-16 at some later
> > > point in time and it seems that this time has now come.
> >
> > How hard would it be to also change the party line about what the
> > encoding used is based on whether we use 2 or 4 bytes? We could even
> > give three choices: UCS-2 (current situation, no surrogates), UTF-16
> > (16-bit items with some surrogate support) or UCS-4 (32-bit items)?
>
> Ehm... what are you getting at here ?
Earlier on you said it would be hard to offer a config-time choice
between UTF-16 and UCS-4. I'm still trying to figure out why. Given
the additional stuff I've learned now about surrogates, it doesn't
make sense to choose between UCS-2 and UTF-16; the surrogate handling
can always be present.
So let me rephrase the question. How hard would it be to offer the
config-time choice between UCS-4 and UTF-16? If it's hard, why?
(I've heard you say that it's hard before, but I still don't
understand the problem.)
> > > > I'd be happy to make the configuration choice between UTF-16 and
> > > > UCS-4, if that's doable.
> > >
> > > Not easily, I'm afraid.
> >
> > Can you explain why this is not easy?
>
> Because choosing whether or not to support surrogates is a
> fundamental choice which affects far more than just the way you
> access storage. Surrogates introduce variable width characters:
> some characters use two or more Py_UNICODE code units while (most)
> others only use one.
>
> Remember when we discussed which internal format to use or
> which default encoding to apply ? We ruled out UTF-8 because
> it fails badly when it comes to slicing, concatenation, indexing,
> etc.
>
> UTF-16 is much less painful as most code points only take
> up a single code unit, but it still introduces a break in concept.
Hm, it sounds like you have the same problem that I had with Ton
Emerson's suggestion to support Unicode before he clarified it.
If we make a clean distinction between characters and storage units,
and if stick to the rule that u[i] accesses a storage unit, what's the
conceptual difficulty? There might be a separate method u.char(i)
which returns the *character* starting u[i:], or "" if u[i] is a
low-surrogate. That could be all we need to support surrogates. How
bad is that? (These could even continue to be supported when the
storage uses UCS-4; there, u.char(i) would always be u[i], until
someone comes up with a 64-bit character set. ;-)
> > I buy that as an argument for supporting UTF-16, but not for cutting
> > off the road to supporting UCS-4 for those users who would like to opt
> > in.
>
> That was not my point. I just wanted to point out how well UTF-16
> is being accepted out there and that we are in good company by
> moving from UCS-2 to UTF-16 as current internal format.
Good! I agree.
> I don't want to cut off the road to UCS-4, I just want to make
> clear that UTF-16 is a good choice and one which will last at
> least some more years. We can then always decide to move on
> to UCS-4 for the internal storage format.
Agreed again.
--Guido van Rossum (home page: http://www.python.org/~guido/)