Python's handling of unicode surrogates
Rhamphoryncus
rhamph at gmail.com
Fri Apr 20 02:41:18 EDT 2007
On Apr 19, 11:02 pm, Neil Hodgson <nyamatongwe+thun... at gmail.com>
wrote:
> Adam Olsen:
>
> > To solve this I propose Python's unicode type using UTF-16 should have
> > gaps in its index, allowing it to only expose complete unicode scalar
> > values. Iteration would produce surrogate pairs rather than
> > individual surrogates, indexing to the first half of a surrogate pair
> > would produce the entire pair (indexing to the second half would raise
> > IndexError), and slicing would be required to not separate a surrogate
> > pair (IndexError otherwise).
>
> I expect having sequences with inaccessible indices will prove
> overly surprising. They will behave quite similar to existing Python
> sequences except when code that works perfectly well against other
> sequences throws exceptions very rarely.
"Errors should never pass silently."
The only way I can think of to make surrogates unsurprising would be
to use UTF-8, thereby bombarding programmers with variable-length
characters.
> > Reasons to treat surrogates as undivisible:
> > * \U escapes and repr() already do this
> > * unichr(0x10000) would work on all unicode scalar values
>
> unichr could return a 2 code unit string without forcing surrogate
> indivisibility.
Indeed. I was actually surprised that it didn't, originally I had it
listed with \U and repr().
> > * "There is no separate character type; a character is represented by
> > a string of one item."
>
> Could amend this to "a string of one or two items".
>
> > * iteration would be identical on all platforms
>
> There could be a secondary iterator that iterates over characters
> rather than code units.
But since you should use that iterator 90%+ of the time, why not make
it the default?
> > * sorting would be identical on all platforms
>
> This should be fixable in the current scheme.
True.
> > * UTF-8 or UTF-32 containing surrogates, or UTF-16 containing isolated
> > surrogates, are ill-formed[2].
>
> It would be interesting to see how far specifying (and enforcing)
> UTF-16 over the current implementation would take us. That is for the 16
> bit Unicode implementation raising an exception if an operation would
> produce an unpaired surrogate or other error. Single element indexing is
> a problem although it could yield a non-string type.
Err, what would be the point in having a non-string type when you
could just as easily produce a string containing the surrogate pair?
That's half my proposal.
> > Reasons against such a change:
> > * Breaks code which does range(len(s)) or enumerate(s). This can be
> > worked around by using s = list(s) first.
>
> The code will work happily for the implementor and then break when
> exposed to a surrogate.
The code may well break already. I just make it explicit.
> > * "Nobody is forcing you to use characters above 0xFFFF". This is a
> > strawman. Unicode goes beyond 0xFFFF because real languages need it.
> > Software should not break just because the user speaks a different
> > language than the programmer.
>
> Characters over 0xFFFF are *very* rare. Most of the Supplementary
> Multilingual Plane is for historical languages and I don't think there
> are any surviving Phoenician speakers. Maybe the extra mathematical
> signs or musical symbols will prove useful one software and fonts are
> implemented for these ranges. The Supplementary Ideographic Plane is
> historic Chinese and may have more users.
Yet Unicode has deemed them worth including anyway. I see no reason
to make them more painful then they have to be.
A program written to use them today would most likely a) avoid
iteration, and b) replace indexes with slices (s[i] -> s[i:i
+len(sub)]. If they need iteration they'll have to reimplement it,
providing the exact behaviour I propose. Or they can recompile Python
to use UTF-32, but why shouldn't such features be available by
default?
> I think that effort would be better spent on an implementation that
> appears to be UTF-32 but uses UTF-16 internally. The vast majority of
> the time, no surrogates will be present, so operations can be simple and
> fast. When a string contains a surrogate, a flag is flipped and all
> operations go through more complex and slower code paths. This way,
> consumers of the type see a simple, consistent interface which will not
> report strange errors when used.
Your solution would require code duplication and would be slower. My
solution would have no duplication and would not be slower. I like
mine. ;)
> BTW, I just implemented support for supplemental planes (surrogates,
> 4 byte UTF-8 sequences) for Scintilla, a text editing component.
I dream of a day when complete unicode support is universal. With
enough effort we may get there some day. :)
--
Adam Olsen, aka Rhamphoryncus
More information about the Python-list
mailing list