break unichr instead of fix ord?

Wed Aug 26 19:29:34 EDT 2009

On Aug 25, 9:53 pm, "Mark Tolonen" <metolone+gm... at gmail.com> wrote:
> <ru... at yahoo.com> wrote in message
>
> news:2ad21a79-4a6c-42a7-8923-beb304bb5e99 at v20g2000yqm.googlegroups.com...
>
>
>
> > In Python 2.5 on Windows I could do [*1]:
>
> >  # Create a unicode character outside of the BMP.
> >  >>> a = u'\U00010040'
>
> >  # On Windows it is represented as a surogate pair.
> >  >>> len(a)
> >  2
> >  >>> a[0],a[1]
> >  (u'\ud800', u'\udc40')
>
> >  # Create the same character with the unichr() function.
> >  >>> a = unichr (65600)
> >  >>> a[0],a[1]
> >  (u'\ud800', u'\udc40')
>
> >  # Although the unichr() function works fine, its
> >  # inverse, ord(), doesn't.
> >  >>> ord (a)
> >  TypeError: ord() expected a character, but string of length 2 found
>
> > On Python 2.6, unichr() was "fixed" (using the word
> > loosely) so that it too now fails with characters outside
> > the BMP.
>
> >  >>> a = unichr (65600)
> >  ValueError: unichr() arg not in range(0x10000) (narrow Python build)
>
> > Why was this done rather than changing ord() to accept a
> > surrogate pair?
>
> > Does not this effectively make unichr() and ord() useless
> > on Windows for all but a subset of unicode characters?
>
> Switch to Python 3?
>
> >>> x='\U00010040'
> >>> import unicodedata
> >>> unicodedata.name(x)
>
> 'LINEAR B SYLLABLE B025 A2'>>> ord(x)
> 65600
> >>> hex(ord(x))
> '0x10040'
> >>> unicodedata.name(chr(0x10040))
>
> 'LINEAR B SYLLABLE B025 A2'>>> ord(chr(0x10040))
> 65600
> >>> print(ascii(chr(0x10040)))
>
> '\ud800\udc40'
>
> -Mark

I am still a long way away from moving to Python 3
but I am looking forward to hopefully more rational
unicode handling there.  Thanks for the info.