[I18n-sig] Unicode surrogates: just say no!

Machin, John JMachin@Colonial.com.au
Thu, 28 Jun 2001 10:05:39 +1000

OK. I take (most of) your point on consistency between unichr() and ord().

However there is a practical problem with ord(surrogate_pair) on a
narrow Python. 

ord('\x01') -> 1
ord('\x01\x02') -> exception
ord(u'\u0001') -> 1
ord(u'\u0001\u0002') -> exception
ord(u'\ud800\udc00') -> 0x10000 # magic!

so either 
(a) programmer wanting to write (say) the 
conversion tool that you mentioned still has to work very hard
or (b) we redefine ord() so that the arg may also be a Unicode 
string, and it returns the ordinal of the first character (which may involve
two code units)
or (c) we provide some other functionality for unpacking Unicode strings
into ints

-----Original Message-----
From: Guido van Rossum [mailto:guido@digicool.com]
Sent: Thursday, 28 June 2001 9:38
To: Machin, John
Cc: i18n-sig@python.org
Subject: Re: [I18n-sig] Unicode surrogates: just say no!

> Guido said:
>    But on a narrow interpreter, that's a valid surrogate pair, so it's a
>    single character, so ord() *should* return 0x10000 for this example.
> IMO, once you say that a "valid surrogate pair" is a "single
> character" in a narrow implementation, people will want to do
> the indexing / slicing /dicing thing as well. ord() is just the 
> thin end of the wedge.
> "No" should mean "no".
> unichr() and ord() should be inverses *only*
> in respect of scalar values up to sys.maxunicode.

Your position is weakened by inconsistency.  If you really wanted to
be consistent, you should argue against \U and unichr() with ordinals
>= 0x10000 on narrow Pythons. :-)

IMO ord() and unichr() are so closely tied that either both of them
should support surrogate pairs, or none.  You know my position.  It's
not usable as a wedge to get the indexing/slicing/dicing, because the
implementation would be too complicated, and we have the wide Python
as a mighty weapon.

BTW, I quoted Paul:

> >     * ord() will now accept surrogate pairs and return the ordinal of
> >       the "wide" character. Open question: should it accept surrogate
> >       pairs on wide Python builds?

and replied:

> After thinking about it, I think it should.  Apps that are written
> specifically to handle surrogates (e.g. a conversion tool to remove
> surrogates!) should work on wide interpreters, and ord() is the only
> way to get the character value from a surrogate pair (short from
> implementing the shifts and masks yourself, which is doable but a
> pain).

I take that back.  On wide Pythons, unichr() doesn't return surrogates
either.  Once the whole world uses UCS-4 (around the time Python 3000
is released :-), surrogates can be deprecated anyway.

--Guido van Rossum (home page: http://www.python.org/~guido/)

**************   IMPORTANT MESSAGE  **************

The information contained in or attached to this message is intended only for the people it is addressed to. If you are not the intended recipient, any use, disclosure or copying of this information is unauthorised and prohibited. This information may be confidential or subject to legal privilege. It is not the expressed view of Colonial Limited or any of its subsidiaries unless that is clearly stated. Colonial cannot accept liability for any virus damage caused by this message.