[I18n-sig] Unicode surrogates: just say no!

Guido van Rossum guido@digicool.com
Wed, 27 Jun 2001 15:30:19 -0400

> I'm trying to sift through all of the decisions made in different
> messages for the PEP.


> Guido van Rossum wrote:
> > 
> >...
> > 
> >   - unichr(i) for 0x10000 <= i <= 0x10ffff (and hence corresponding \u
> >     and \U) generates a surrogate pair, where u[0] is the high
> >     surrogate value and u[1] the low surrogate value
> Does this imply that ord() should take in surrogate pairs too?

Oooh, hadn't thought of that, but yes, it makes sense!

Not yet implemented, but I think it should.  Makes for a nice pair
of invariants:

  unichr(ord('\Udddddddd')) == '\Udddddddd'
  ord(unichr(0xdddddddd)) == 0xdddddddd

regardless of whether we're using UCS-2 or UCS-4 storage.

Currently this is broken for 0xdddddddd > 0xffff with UCS-2 storage.

On the other hand, unichr() and ord() should still work for lone
surrogate values as well (even though these are invalid code points).

--Guido van Rossum (home page: http://www.python.org/~guido/)