Python's handling of unicode surrogates

Fri Apr 20 01:02:52 EDT 2007

Adam Olsen:

> To solve this I propose Python's unicode type using UTF-16 should have
> gaps in its index, allowing it to only expose complete unicode scalar
> values.  Iteration would produce surrogate pairs rather than
> individual surrogates, indexing to the first half of a surrogate pair
> would produce the entire pair (indexing to the second half would raise
> IndexError), and slicing would be required to not separate a surrogate
> pair (IndexError otherwise).

    I expect having sequences with inaccessible indices will prove 
overly surprising. They will behave quite similar to existing Python 
sequences except when code that works perfectly well against other 
sequences throws exceptions very rarely.

> Reasons to treat surrogates as undivisible:
> * \U escapes and repr() already do this
> * unichr(0x10000) would work on all unicode scalar values

    unichr could return a 2 code unit string without forcing surrogate 
indivisibility.

> * "There is no separate character type; a character is represented by
> a string of one item."

     Could amend this to "a string of one or two items".

> * iteration would be identical on all platforms

    There could be a secondary iterator that iterates over characters 
rather than code units.

> * sorting would be identical on all platforms

    This should be fixable in the current scheme.

> * UTF-8 or UTF-32 containing surrogates, or UTF-16 containing isolated
> surrogates, are ill-formed[2].

    It would be interesting to see how far specifying (and enforcing) 
UTF-16 over the current implementation would take us. That is for the 16 
bit Unicode implementation raising an exception if an operation would 
produce an unpaired surrogate or other error. Single element indexing is 
a problem although it could yield a non-string type.

> Reasons against such a change:
> * Breaks code which does range(len(s)) or enumerate(s).  This can be
> worked around by using s = list(s) first.

    The code will work happily for the implementor and then break when 
exposed to a surrogate.

> * "Nobody is forcing you to use characters above 0xFFFF".  This is a
> strawman.  Unicode goes beyond 0xFFFF because real languages need it.
> Software should not break just because the user speaks a different
> language than the programmer.

    Characters over 0xFFFF are *very* rare. Most of the Supplementary 
Multilingual Plane is for historical languages and I don't think there 
are any surviving Phoenician speakers. Maybe the extra mathematical 
signs or musical symbols will prove useful one software and fonts are 
implemented for these ranges. The Supplementary Ideographic Plane is 
historic Chinese and may have more users.

    I think that effort would be better spent on an implementation that 
appears to be UTF-32 but uses UTF-16 internally. The vast majority of 
the time, no surrogates will be present, so operations can be simple and 
fast. When a string contains a surrogate, a flag is flipped and all 
operations go through more complex and slower code paths. This way, 
consumers of the type see a simple, consistent interface which will not 
report strange errors when used.

    BTW, I just implemented support for supplemental planes (surrogates, 
4 byte UTF-8 sequences) for Scintilla, a text editing component.

    Neil