[I18n-sig] Unicode surrogates: just say no!

Guido van Rossum guido@digicool.com
Wed, 27 Jun 2001 20:50:37 -0400


> OK. I take (most of) your point on consistency between unichr() and ord().
> 
> However there is a practical problem with ord(surrogate_pair) on a
> narrow Python. 
> 
> ord('\x01') -> 1
> ord('\x01\x02') -> exception
> ord(u'\u0001') -> 1
> ord(u'\u0001\u0002') -> exception
> ord(u'\ud800\udc00') -> 0x10000 # magic!
> 
> so either 
> (a) programmer wanting to write (say) the 
> conversion tool that you mentioned still has to work very hard
> or (b) we redefine ord() so that the arg may also be a Unicode 
> string, and it returns the ordinal of the first character (which may involve
> two code units)
> or (c) we provide some other functionality for unpacking Unicode strings
> into ints

Yes, the longer I think about this the less I like it.  Unfortunately,
the surrogate-creating behavior of \U is present in 2.0 and 2.1, so I
think we can't reasonably remove this from narrow Python 2.2, and I
like the rule that unichr and \U match.  But maybe that's the one that
should go, and unichr() and ord() should deal with single code points
only.

Then sys.maxunicode should be the largest value that unichr() will
accept.  This could be 0xffff (narrow Python), 0x10ffff (wide Python
with strict unichr()), or 0xffffffffL (wide Python with liberal
unichr()).  The latter is an open PEP issue.

--Guido van Rossum (home page: http://www.python.org/~guido/)