[I18n-sig] Unicode surrogates: just say no!
Guido van Rossum
guido@digicool.com
Wed, 27 Jun 2001 15:30:19 -0400
> I'm trying to sift through all of the decisions made in different
> messages for the PEP.
Excellent!
> Guido van Rossum wrote:
> >
> >...
> >
> > - unichr(i) for 0x10000 <= i <= 0x10ffff (and hence corresponding \u
> > and \U) generates a surrogate pair, where u[0] is the high
> > surrogate value and u[1] the low surrogate value
>
> Does this imply that ord() should take in surrogate pairs too?
Oooh, hadn't thought of that, but yes, it makes sense!
Not yet implemented, but I think it should. Makes for a nice pair
of invariants:
unichr(ord('\Udddddddd')) == '\Udddddddd'
ord(unichr(0xdddddddd)) == 0xdddddddd
regardless of whether we're using UCS-2 or UCS-4 storage.
Currently this is broken for 0xdddddddd > 0xffff with UCS-2 storage.
On the other hand, unichr() and ord() should still work for lone
surrogate values as well (even though these are invalid code points).
--Guido van Rossum (home page: http://www.python.org/~guido/)