Python's handling of unicode surrogates

Fri Apr 20 05:52:38 EDT 2007

On 20 Apr, 07:02, Neil Hodgson <nyamatongwe+thun... at gmail.com> wrote:
> Adam Olsen:
>
> > To solve this I propose Python's unicode type using UTF-16 should have
> > gaps in its index, allowing it to only expose complete unicode scalar
> > values.  Iteration would produce surrogate pairs rather than
> > individual surrogates, indexing to the first half of a surrogate pair
> > would produce the entire pair (indexing to the second half would raise
> > IndexError), and slicing would be required to not separate a surrogate
> > pair (IndexError otherwise).
>
>     I expect having sequences with inaccessible indices will prove
> overly surprising. They will behave quite similar to existing Python
> sequences except when code that works perfectly well against other
> sequences throws exceptions very rarely.

This thread and the other one have been quite educational, and I've
been looking through some of the background material on the topic. I
think the intention was, in PEP 261 [1] and the surrounding
discussion, that people should be able to treat Unicode objects as
sequences of characters, even though GvR's summary [2] in that
discussion defines a character as representing a code point, not a
logical character. In such a scheme, characters should be indexed
contiguously, and if people should want to access surrogate pairs,
there should be a method (or module function) to expose that
information on individual (logical) characters.

> > Reasons to treat surrogates as undivisible:
> > * \U escapes and repr() already do this
> > * unichr(0x10000) would work on all unicode scalar values
>
>     unichr could return a 2 code unit string without forcing surrogate
> indivisibility.

This would work with the "substring in string" and
"string.index(substring)" pseudo-sequence API. However, once you've
got a character as a Unicode object, surely the nature of the encoded
character is only of peripheral interest. The Unicode API doesn't
return two or more values per character for those in the Basic
Multilingual Plane read from a UTF-8 source - that's inconsequential
detail at that particular point.

[...]

>     I think that effort would be better spent on an implementation that
> appears to be UTF-32 but uses UTF-16 internally. The vast majority of
> the time, no surrogates will be present, so operations can be simple and
> fast. When a string contains a surrogate, a flag is flipped and all
> operations go through more complex and slower code paths. This way,
> consumers of the type see a simple, consistent interface which will not
> report strange errors when used.

I think PEP 261 was mostly concerned with providing a "good enough"
solution until such a time as a better solution could be devised.

>     BTW, I just implemented support for supplemental planes (surrogates,
> 4 byte UTF-8 sequences) for Scintilla, a text editing component.

Do we have a volunteer? ;-)

Paul

[1] http://www.python.org/dev/peps/pep-0261/
[2] http://mail.python.org/pipermail/i18n-sig/2001-June/001107.html