[I18n-sig] How does Python Unicode treat surrogates?

Paul Prescod paulp@ActiveState.com
Mon, 25 Jun 2001 12:41:15 -0700

Guido van Rossum wrote:
> If we make a clean distinction between characters and storage units,
> and if stick to the rule that u[i] accesses a storage unit, what's the
> conceptual difficulty?  
> There might be a separate method u.char(i)
> which returns the *character* starting u[i:], or "" if u[i] is a
> low-surrogate.

Are you saying that having u[i] return the i'th character (code point)
of 'u' is not going to be provided at all?

> That could be all we need to support surrogates.  How
> bad is that?  (These could even continue to be supported when the
> storage uses UCS-4; there, u.char(i) would always be u[i], until
> someone comes up with a 64-bit character set. ;-)

So the same input will have a different behavior based on the fact that
we upgraded our internal representation? :(

The strikes me as an int/long issue. I'd rather we design in terms of
the logical construct: "arbitrary-sized mathematical integer", "Unicode
code point" rather than the implementation detail: "32-bit 2's
complement integer", "UTF-16 code unit."

Take a recipe. Leave a recipe.  
Python Cookbook!  http://www.ActiveState.com/pythoncookbook