break unichr instead of fix ord?
Mark Tolonen
metolone+gmane at gmail.com
Tue Aug 25 23:53:52 EDT 2009
<rurpy at yahoo.com> wrote in message
news:2ad21a79-4a6c-42a7-8923-beb304bb5e99 at v20g2000yqm.googlegroups.com...
> In Python 2.5 on Windows I could do [*1]:
>
> # Create a unicode character outside of the BMP.
> >>> a = u'\U00010040'
>
> # On Windows it is represented as a surogate pair.
> >>> len(a)
> 2
> >>> a[0],a[1]
> (u'\ud800', u'\udc40')
>
> # Create the same character with the unichr() function.
> >>> a = unichr (65600)
> >>> a[0],a[1]
> (u'\ud800', u'\udc40')
>
> # Although the unichr() function works fine, its
> # inverse, ord(), doesn't.
> >>> ord (a)
> TypeError: ord() expected a character, but string of length 2 found
>
> On Python 2.6, unichr() was "fixed" (using the word
> loosely) so that it too now fails with characters outside
> the BMP.
>
> >>> a = unichr (65600)
> ValueError: unichr() arg not in range(0x10000) (narrow Python build)
>
> Why was this done rather than changing ord() to accept a
> surrogate pair?
>
> Does not this effectively make unichr() and ord() useless
> on Windows for all but a subset of unicode characters?
Switch to Python 3?
>>> x='\U00010040'
>>> import unicodedata
>>> unicodedata.name(x)
'LINEAR B SYLLABLE B025 A2'
>>> ord(x)
65600
>>> hex(ord(x))
'0x10040'
>>> unicodedata.name(chr(0x10040))
'LINEAR B SYLLABLE B025 A2'
>>> ord(chr(0x10040))
65600
>>> print(ascii(chr(0x10040)))
'\ud800\udc40'
-Mark
More information about the Python-list
mailing list