break unichr instead of fix ord?

rurpy at rurpy at
Tue Aug 25 21:45:49 CEST 2009

In Python 2.5 on Windows I could do [*1]:

  # Create a unicode character outside of the BMP.
  >>> a = u'\U00010040'

  # On Windows it is represented as a surogate pair.
  >>> len(a)
  >>> a[0],a[1]
  (u'\ud800', u'\udc40')

  # Create the same character with the unichr() function.
  >>> a = unichr (65600)
  >>> a[0],a[1]
  (u'\ud800', u'\udc40')

  # Although the unichr() function works fine, its
  # inverse, ord(), doesn't.
  >>> ord (a)
  TypeError: ord() expected a character, but string of length 2 found

On Python 2.6, unichr() was "fixed" (using the word
loosely) so that it too now fails with characters outside
the BMP.

  >>> a = unichr (65600)
  ValueError: unichr() arg not in range(0x10000) (narrow Python build)

Why was this done rather than changing ord() to accept a
surrogate pair?

Does not this effectively make unichr() and ord() useless
on Windows for all but a subset of unicode characters?

More information about the Python-list mailing list