break unichr instead of fix ord?
Steven D'Aprano
steve at REMOVE-THIS-cybersource.com.au
Wed Aug 26 22:52:41 EDT 2009
On Wed, 26 Aug 2009 16:27:33 -0700, rurpy wrote:
> But regardless, the significant question is, what is the reason for
> having ord() (and unichr) not work for surrogate pairs and thus not
> usable with a large number of unicode characters that Python otherwise
> supports?
I'm no expert on Unicode, but my guess is that the reason is out of a
desire for simplicity: unichr() should always return a single char, not a
pair of chars, and similarly ord() should take as input a single char,
not two, and return a single number.
Otherwise it would be ambiguous whether ord(surrogate_pair) should return
a pair of ints representing the codes for each item in the pair, or a
single int representing the code point for the whole pair.
E.g. given your earlier example:
>>> a = u'\U00010040'
>>> len(a)
2
>>> a[0]
u'\ud800'
>>> a[1]
u'\udc40'
would you expect ord(a) to return (0xd800, 0xdc40) or 0x10040? If the
latter, what about ord(u'ab')?
Remember that a unicode string can contain code points that aren't valid
characters:
>>> ord(u'\ud800') # reserved for surrogates, not a character
55296
so if ord() sees a surrogate pair, it can't assume it's meant to be
treated as a surrogate pair rather than a pair of code points that just
happens to match a surrogate pair.
None of this means you can't deal with surrogate pairs, it just means you
can't deal with them using ord() and unichr().
The above is just my guess, I'd be interested to hear what others say.
--
Steven
More information about the Python-list
mailing list