Python's handling of unicode surrogates

Fri Apr 20 02:41:18 EDT 2007

On Apr 19, 11:02 pm, Neil Hodgson <nyamatongwe+thun... at gmail.com>
wrote:
> Adam Olsen:
>
> > To solve this I propose Python's unicode type using UTF-16 should have
> > gaps in its index, allowing it to only expose complete unicode scalar
> > values.  Iteration would produce surrogate pairs rather than
> > individual surrogates, indexing to the first half of a surrogate pair
> > would produce the entire pair (indexing to the second half would raise
> > IndexError), and slicing would be required to not separate a surrogate
> > pair (IndexError otherwise).
>
>     I expect having sequences with inaccessible indices will prove
> overly surprising. They will behave quite similar to existing Python
> sequences except when code that works perfectly well against other
> sequences throws exceptions very rarely.

"Errors should never pass silently."

The only way I can think of to make surrogates unsurprising would be
to use UTF-8, thereby bombarding programmers with variable-length
characters.

> > Reasons to treat surrogates as undivisible:
> > * \U escapes and repr() already do this
> > * unichr(0x10000) would work on all unicode scalar values
>
>     unichr could return a 2 code unit string without forcing surrogate
> indivisibility.

Indeed.  I was actually surprised that it didn't, originally I had it
listed with \U and repr().

> > * "There is no separate character type; a character is represented by
> > a string of one item."
>
>      Could amend this to "a string of one or two items".
>
> > * iteration would be identical on all platforms
>
>     There could be a secondary iterator that iterates over characters
> rather than code units.

But since you should use that iterator 90%+ of the time, why not make
it the default?

> > * sorting would be identical on all platforms
>
>     This should be fixable in the current scheme.

True.

> > * UTF-8 or UTF-32 containing surrogates, or UTF-16 containing isolated
> > surrogates, are ill-formed[2].
>
>     It would be interesting to see how far specifying (and enforcing)
> UTF-16 over the current implementation would take us. That is for the 16
> bit Unicode implementation raising an exception if an operation would
> produce an unpaired surrogate or other error. Single element indexing is
> a problem although it could yield a non-string type.

Err, what would be the point in having a non-string type when you
could just as easily produce a string containing the surrogate pair?
That's half my proposal.

> > Reasons against such a change:
> > * Breaks code which does range(len(s)) or enumerate(s).  This can be
> > worked around by using s = list(s) first.
>
>     The code will work happily for the implementor and then break when
> exposed to a surrogate.

The code may well break already.  I just make it explicit.

> > * "Nobody is forcing you to use characters above 0xFFFF".  This is a
> > strawman.  Unicode goes beyond 0xFFFF because real languages need it.
> > Software should not break just because the user speaks a different
> > language than the programmer.
>
>     Characters over 0xFFFF are *very* rare. Most of the Supplementary
> Multilingual Plane is for historical languages and I don't think there
> are any surviving Phoenician speakers. Maybe the extra mathematical
> signs or musical symbols will prove useful one software and fonts are
> implemented for these ranges. The Supplementary Ideographic Plane is
> historic Chinese and may have more users.

Yet Unicode has deemed them worth including anyway.  I see no reason
to make them more painful then they have to be.

A program written to use them today would most likely a) avoid
iteration, and b) replace indexes with slices (s[i] -> s[i:i
+len(sub)].  If they need iteration they'll have to reimplement it,
providing the exact behaviour I propose.  Or they can recompile Python
to use UTF-32, but why shouldn't such features be available by
default?

>     I think that effort would be better spent on an implementation that
> appears to be UTF-32 but uses UTF-16 internally. The vast majority of
> the time, no surrogates will be present, so operations can be simple and
> fast. When a string contains a surrogate, a flag is flipped and all
> operations go through more complex and slower code paths. This way,
> consumers of the type see a simple, consistent interface which will not
> report strange errors when used.

Your solution would require code duplication and would be slower.  My
solution would have no duplication and would not be slower.  I like
mine. ;)

>     BTW, I just implemented support for supplemental planes (surrogates,
> 4 byte UTF-8 sequences) for Scintilla, a text editing component.

I dream of a day when complete unicode support is universal.  With
enough effort we may get there some day. :)

--
Adam Olsen, aka Rhamphoryncus