Python's handling of unicode surrogates
nyamatongwe+thunder at gmail.com
Fri Apr 20 07:02:52 CEST 2007
> To solve this I propose Python's unicode type using UTF-16 should have
> gaps in its index, allowing it to only expose complete unicode scalar
> values. Iteration would produce surrogate pairs rather than
> individual surrogates, indexing to the first half of a surrogate pair
> would produce the entire pair (indexing to the second half would raise
> IndexError), and slicing would be required to not separate a surrogate
> pair (IndexError otherwise).
I expect having sequences with inaccessible indices will prove
overly surprising. They will behave quite similar to existing Python
sequences except when code that works perfectly well against other
sequences throws exceptions very rarely.
> Reasons to treat surrogates as undivisible:
> * \U escapes and repr() already do this
> * unichr(0x10000) would work on all unicode scalar values
unichr could return a 2 code unit string without forcing surrogate
> * "There is no separate character type; a character is represented by
> a string of one item."
Could amend this to "a string of one or two items".
> * iteration would be identical on all platforms
There could be a secondary iterator that iterates over characters
rather than code units.
> * sorting would be identical on all platforms
This should be fixable in the current scheme.
> * UTF-8 or UTF-32 containing surrogates, or UTF-16 containing isolated
> surrogates, are ill-formed.
It would be interesting to see how far specifying (and enforcing)
UTF-16 over the current implementation would take us. That is for the 16
bit Unicode implementation raising an exception if an operation would
produce an unpaired surrogate or other error. Single element indexing is
a problem although it could yield a non-string type.
> Reasons against such a change:
> * Breaks code which does range(len(s)) or enumerate(s). This can be
> worked around by using s = list(s) first.
The code will work happily for the implementor and then break when
exposed to a surrogate.
> * "Nobody is forcing you to use characters above 0xFFFF". This is a
> strawman. Unicode goes beyond 0xFFFF because real languages need it.
> Software should not break just because the user speaks a different
> language than the programmer.
Characters over 0xFFFF are *very* rare. Most of the Supplementary
Multilingual Plane is for historical languages and I don't think there
are any surviving Phoenician speakers. Maybe the extra mathematical
signs or musical symbols will prove useful one software and fonts are
implemented for these ranges. The Supplementary Ideographic Plane is
historic Chinese and may have more users.
I think that effort would be better spent on an implementation that
appears to be UTF-32 but uses UTF-16 internally. The vast majority of
the time, no surrogates will be present, so operations can be simple and
fast. When a string contains a surrogate, a flag is flipped and all
operations go through more complex and slower code paths. This way,
consumers of the type see a simple, consistent interface which will not
report strange errors when used.
BTW, I just implemented support for supplemental planes (surrogates,
4 byte UTF-8 sequences) for Scintilla, a text editing component.
More information about the Python-list