Python's handling of unicode surrogates

Adam Olsen rhamph at gmail.com
Fri Apr 20 01:34:42 CEST 2007


As was seen in another thread[1], there's a great deal of confusion
with regard to surrogates.  Most programmers assume Python's unicode
type exposes only complete characters.  Even CPython's own functions
do this on occasion.  This leads to different behaviour across
platforms and makes it unnecessarily difficult to properly support all
languages.

To solve this I propose Python's unicode type using UTF-16 should have
gaps in its index, allowing it to only expose complete unicode scalar
values.  Iteration would produce surrogate pairs rather than
individual surrogates, indexing to the first half of a surrogate pair
would produce the entire pair (indexing to the second half would raise
IndexError), and slicing would be required to not separate a surrogate
pair (IndexError otherwise).

Note that this would not harm performance, nor would it affects
programs that already handle UTF-16 and UTF-32 correctly.

To show how things currently differ across platforms, here's an
example using UTF-32:

>>> a, b = u'\U00100000', u'\uFFFF'
>>> a, b
(u'\U00100000', u'\uffff')
>>> list(a), list(b)
([u'\U00100000'], [u'\uffff'])
>>> sorted([a, b])
[u'\uffff', u'\U00100000']

Now contrast the output of sorted() with what you get when using UTF-16:

>>> a, b = u'\U00100000', u'\uFFFF'
>>> a, b
(u'\U00100000', u'\uffff')
>>> list(a), list(b)
([u'\udbc0', '\udc00'], [u'\uffff'])
>>> sorted([a, b])
[u'\U00100000', u'\uffff']

As you can see, the order has be reversed, because the sort operates
on code units rather than scalar values.

Reasons to treat surrogates as undivisible:
* \U escapes and repr() already do this
* unichr(0x10000) would work on all unicode scalar values
* "There is no separate character type; a character is represented by
a string of one item."
* iteration would be identical on all platforms
* sorting would be identical on all platforms
* UTF-8 or UTF-32 containing surrogates, or UTF-16 containing isolated
surrogates, are ill-formed[2].

Reasons against such a change:
* Breaks code which does range(len(s)) or enumerate(s).  This can be
worked around by using s = list(s) first.
* Breaks code which does s[s.find(sub)], where sub is a single
surrogate, expecting half a surrogate (why?).  Does NOT break code
where sub is always a single code unit, nor does it break code that
assumes a longer sub using s[s.find(sub):s.find(sub) + len(sub)]
* Alters the sort order on UTF-16 platforms (to match that of UTF-32
platforms, not to mention UTF-8 encoded byte strings)
* Current Python is fairly tolerant of ill-formed unicode data.
Changing this may break some code.  However, if you do have a need to
twiddle low-level UTF encodings, wouldn't the bytes type be better?
* "Nobody is forcing you to use characters above 0xFFFF".  This is a
strawman.  Unicode goes beyond 0xFFFF because real languages need it.
Software should not break just because the user speaks a different
language than the programmer.

Thoughts, from all you readers out there?  For/against?  If there's
enough support I'll post the idea on python-3000.

[1] http://groups.google.com/group/comp.lang.python/browse_thread/thread/7e9327a896c242e7/4876e191831da6de
[2] Pages 23-24 of http://unicode.org/versions/Unicode4.0.0/ch03.pdf

-- 
Adam Olsen, aka Rhamphoryncus



More information about the Python-list mailing list