break unichr instead of fix ord?

Thu Aug 27 01:36:12 EDT 2009

On 08/26/2009 08:52 PM, Steven D'Aprano wrote:
> On Wed, 26 Aug 2009 16:27:33 -0700, rurpy wrote:
>
>>  But regardless, the significant question is, what is the reason for
>>  having ord() (and unichr) not work for surrogate pairs and thus not
>>  usable with a large number of unicode characters that Python otherwise
>>  supports?
>
>
> I'm no expert on Unicode, but my guess is that the reason is out of a
> desire for simplicity: unichr() should always return a single char, not a
> pair of chars, and similarly ord() should take as input a single char,
> not two, and return a single number.
>
> Otherwise it would be ambiguous whether ord(surrogate_pair) should return
> a pair of ints representing the codes for each item in the pair, or a
> single int representing the code point for the whole pair.
>
> E.g. given your earlier example:
>
>>>>  a = u'\U00010040'
>>>>  len(a)
> 2
>>>>  a[0]
> u'\ud800'
>>>>  a[1]
> u'\udc40'
>
> would you expect ord(a) to return (0xd800, 0xdc40) or 0x10040?

The latter.

> If the
> latter, what about ord(u'ab')?

I would expect a TypeError* (as ord() currently raises) because
the string length is not 1 and 'ab' is not a surrogate pair.

*Actually I would have expected ValueError but I'm not going
to lose sleep over it.

> Remember that a unicode string can contain code points that aren't valid
> characters:
>
>>>>  ord(u'\ud800')  # reserved for surrogates, not a character
> 55296
>
> so if ord() sees a surrogate pair, it can't assume it's meant to be
> treated as a surrogate pair rather than a pair of code points that just
> happens to match a surrogate pair.

Well, actually, yes it can.  :-)

Python has already made a strong statement that such a pair
the representation of a character:

>>> a = ''.join([u'\ud800',u'\udc40'])
>>> a
u'\U00010040'

That is, Python prints, and treats in nearly all other contexts,
that combination as a character.

This is related to the practicality argument: what is the ratio
of need treat a surrogate pair as character consistent with
with the rest of Python, vs the need to treat it as a string
of two separate (and invalid in the unicode sense?) characters?

And if you want to treat each half of the pair separately
it's not exactly hard:  ord(a[0]), ord(a[1]).

> None of this means you can't deal with surrogate pairs, it just means you
> can't deal with them using ord() and unichr().

Kind of like saying, it doesn't mean you can't deal
with integers larger that 2**32, you just can't multiply
and divide them.

> The above is just my guess, I'd be interested to hear what others say.