break unichr instead of fix ord?
rurpy at yahoo.com
rurpy at yahoo.com
Thu Aug 27 07:36:12 CEST 2009
On 08/26/2009 08:52 PM, Steven D'Aprano wrote:
> On Wed, 26 Aug 2009 16:27:33 -0700, rurpy wrote:
>> But regardless, the significant question is, what is the reason for
>> having ord() (and unichr) not work for surrogate pairs and thus not
>> usable with a large number of unicode characters that Python otherwise
> I'm no expert on Unicode, but my guess is that the reason is out of a
> desire for simplicity: unichr() should always return a single char, not a
> pair of chars, and similarly ord() should take as input a single char,
> not two, and return a single number.
> Otherwise it would be ambiguous whether ord(surrogate_pair) should return
> a pair of ints representing the codes for each item in the pair, or a
> single int representing the code point for the whole pair.
> E.g. given your earlier example:
>>>> a = u'\U00010040'
> would you expect ord(a) to return (0xd800, 0xdc40) or 0x10040?
> If the
> latter, what about ord(u'ab')?
I would expect a TypeError* (as ord() currently raises) because
the string length is not 1 and 'ab' is not a surrogate pair.
*Actually I would have expected ValueError but I'm not going
to lose sleep over it.
> Remember that a unicode string can contain code points that aren't valid
>>>> ord(u'\ud800') # reserved for surrogates, not a character
> so if ord() sees a surrogate pair, it can't assume it's meant to be
> treated as a surrogate pair rather than a pair of code points that just
> happens to match a surrogate pair.
Well, actually, yes it can. :-)
Python has already made a strong statement that such a pair
the representation of a character:
>>> a = ''.join([u'\ud800',u'\udc40'])
That is, Python prints, and treats in nearly all other contexts,
that combination as a character.
This is related to the practicality argument: what is the ratio
of need treat a surrogate pair as character consistent with
with the rest of Python, vs the need to treat it as a string
of two separate (and invalid in the unicode sense?) characters?
And if you want to treat each half of the pair separately
it's not exactly hard: ord(a), ord(a).
> None of this means you can't deal with surrogate pairs, it just means you
> can't deal with them using ord() and unichr().
Kind of like saying, it doesn't mean you can't deal
with integers larger that 2**32, you just can't multiply
and divide them.
> The above is just my guess, I'd be interested to hear what others say.
More information about the Python-list