break unichr instead of fix ord?
rurpy at yahoo.com
rurpy at yahoo.com
Wed Aug 26 19:29:34 EDT 2009
On Aug 25, 9:53 pm, "Mark Tolonen" <metolone+gm... at gmail.com> wrote:
> <ru... at yahoo.com> wrote in message
>
> news:2ad21a79-4a6c-42a7-8923-beb304bb5e99 at v20g2000yqm.googlegroups.com...
>
>
>
> > In Python 2.5 on Windows I could do [*1]:
>
> > # Create a unicode character outside of the BMP.
> > >>> a = u'\U00010040'
>
> > # On Windows it is represented as a surogate pair.
> > >>> len(a)
> > 2
> > >>> a[0],a[1]
> > (u'\ud800', u'\udc40')
>
> > # Create the same character with the unichr() function.
> > >>> a = unichr (65600)
> > >>> a[0],a[1]
> > (u'\ud800', u'\udc40')
>
> > # Although the unichr() function works fine, its
> > # inverse, ord(), doesn't.
> > >>> ord (a)
> > TypeError: ord() expected a character, but string of length 2 found
>
> > On Python 2.6, unichr() was "fixed" (using the word
> > loosely) so that it too now fails with characters outside
> > the BMP.
>
> > >>> a = unichr (65600)
> > ValueError: unichr() arg not in range(0x10000) (narrow Python build)
>
> > Why was this done rather than changing ord() to accept a
> > surrogate pair?
>
> > Does not this effectively make unichr() and ord() useless
> > on Windows for all but a subset of unicode characters?
>
> Switch to Python 3?
>
> >>> x='\U00010040'
> >>> import unicodedata
> >>> unicodedata.name(x)
>
> 'LINEAR B SYLLABLE B025 A2'>>> ord(x)
> 65600
> >>> hex(ord(x))
> '0x10040'
> >>> unicodedata.name(chr(0x10040))
>
> 'LINEAR B SYLLABLE B025 A2'>>> ord(chr(0x10040))
> 65600
> >>> print(ascii(chr(0x10040)))
>
> '\ud800\udc40'
>
> -Mark
I am still a long way away from moving to Python 3
but I am looking forward to hopefully more rational
unicode handling there. Thanks for the info.
More information about the Python-list
mailing list