[I18n-sig] Unicode surrogates: just say no!

Tue, 26 Jun 2001 04:51:38 -0400

I'm trying to reset this discussion to come to some sort of
conclusion.  There's been a lot of useful input; I believe I've read
and understood it all.  May the new thread subject serve as a summary
of my position. :-)

Terminology: "character" is a Unicode code point; "unit" is a storage
unit, i.e. a 16-bit or 32-bit value.  A "surrogate pair" is two
16-bit storage units with special values that represent a single
character.  I'll use "surrogate" for a single storage unit whose value
indicates that it should be part of a surrogate pair.  The variable u
is a Python Unicode string object of some sort.

There are several possible options for representing Unicode strings:

1. The current situation.  I'd say that this uses UCS-2 for storage;
   it doesn't pay any attention to surrogates.  u[i] might be a lone
   surrogate.  unicode(i) where i is a lone surrogate value returns a
   string containing a lone surrogate.  An application could use the
   unicode data type to store UTF-16 units, but it would have to be
   aware of all the rules pertaining to surrogates.  The codecs,
   however, are surrogate-unaware.  (Am I right that even the UTF-16
   codec pays *no* special attention to surrogates?)

2. The compromise proposal.  This uses true UTF-16 for storage and
   changes the interface to always deal in characters.  unichr(i)
   where i is a lone surrogate is illegal, and so are the
   corresponding \u and \U encodings.  unichr(i) for 0x10000 <= i <
   0x100000 will return a one-character string that happens to be
   represented using a surrogate pair, but there's no way in Python to
   find out (short of knowing the implementation).  Codecs that are
   capable of encoding full Unicode need to be aware of surrogate
   pairs.

3. The ideal situation.  This uses UCS-4 for storage and doesn't
   require any support for surrogates except in the UTF-16 codecs (and
   maybe in the UTF-8 codecs; it seems that encoded surrogate pairs
   are legal in UTF-8 streams but should be converted back to a single
   character).  It's unclear to me whether the (illegal, according to
   the Unicode standard) "characters" whose numerical value looks like
   a lone surrogate should be entirely ruled out here, or whether a
   dedicated programmer could create strings containing these.  We
   could make it hard by declaring unichr(i) with surrogate i and \u
   and \U escapes that encode surrogates illegal, and by adding
   explicit checks to codecs as appropriate, but a C extension could
   still create an array containing illegal characters unless we do
   draconian input checking.

Option 1, which does not reasonably support characters >= 0x10000, has
clear problems, and these will grow with time, hence the current
discussion.

As a solution, option 2 seems to be most popular; this must be because
it appears to promise the most efficient storage solution while
allowing the largest range of characters to be represented without
effort for the application.

I'd like to argue that option 2 is REALLY BAD, given where we are, and
that we should provide an upgrade path directly from 1 to 3 instead.

The main problem with option 2 is that it breaks the correspondence
between storage unit indices and character indices, and given Python's
reliance on indexing and slicing for string operations, we need a way
to keep the indexing operation (u[i]) efficient, as in O(1).

Tim suggested a reasonable way to implement 2 efficiently: add a
"finger" to each unicode object that caches the last used index
(mapping the character index to the storage unit index).  This can be
used efficiently to walk through the characters in sequence.  Of
course, we would also have to store the length of the string in
characters (so len(u) can be computed efficiently) as well as in
storage units (so the implementation can efficiently know the storage
boundaries).

Martin has hinted at a solution requiring even less memory per string
object, but I don't know for sure what he is thinking of.  All I can
imagine is a single flag saying "this string contains no surrogates".

But either way, I believe that this requires that every part of the
Unicode implementation be changed to become aware of the difference
between characters and storage units.  Every piece of C code that
currently deals with indices into arrays of Py_UNICODE storage units
will have to be changed.

This would have to be one gigantic patch, just to change the basic
Unicode object implementation.  The assumption that storage indices
and character indices are the same thing appears in almost every
function.

And then think of the required changes to the SRE engine.  It
currently assumes a strict character <--> storage unit equivalence
throughout.  In order to support option 2 correctly, it would have to
become surrogate-aware.  There are two parts to this: the internal
engine needs to realize that e.g. "." and certain "[...]" sets may
match a surrogate pair, and the indices returned by e.g. the span()
method of match objects should be translated to character indices as
expected by the applications.

On the other hand, the changes needed to support option 3 are minimal.
Fredrik claims that SRE already supports this (or at least it's very
close); Tim has looked over the source code of the Unicode object
implementation and has not found any code that would break if
Py_UNICODE were changed to a 32-bit int type.  (There must be some
breakage, since the code as it stands won't build on machines where
sizeof(short) != 2, but it's got to be a very shallow problem.)

I see only one remaining argument against choosing 3 over 2: FUD about
disk and promary memory space usage.

(I can't believe that anyone would still worry about the extra CPU
time, after Fredrik's report that SRE is about as fast with 4 byte
characters as it with 2.  In any case this is secondary to the memory
space issue, as it is only related to the extra cycles needed to move
twice as many bytes around; the cost of most algorithms is determined
mostly by the number of characters (or storage units) processed rather
than by the number of bytes.)

I think the disk space usage problem is dealt with easily by choosing
appropriate encodings; UTF-8 and UTF-16 are both great space-savers,
and I doubt many sites will store large amounts of UCS-4 directly,
given that good codecs are available.

The primary memory space problem will go away with time; assuming that
most textual documents contain at most a few millions of characters,
it's already not that much of a problem on modern machines.
Applications that are required to deal efficiently with larger
documents should support some way of streaming or chunking the data
anyway.

The only remaining question is how to provide an upgrade path to
option 3:

A. At some Python version, we switch.

B. Choose between 1 and 3 based on the platform.

C. Make it a configuration-time choice.

D. Make it a run-time choice.

I hink we all agree that D is bad.  I'd say that C is the best;
eventually (say, when Windows is fixed :-) the choice becomes
unnecessary.  I don't think it will be hard to support C, with some
careful coding.

Politically, I think C will also look best to the users -- it allows
sites to make their own decision based on storage needs (i.e. do they
have the main memory it takes) and compatibility requirements (i.e. do
they need the full Unicode set).  I don't think interoperability will
be much of a problem, since file exchanges should use encodings.  Oh
yes, we'll need a UCS-4 codec or two (one for each byte order).

We could use B to determine the default choice, e.g. we could choose
between option 1 and 3 depending on the platform's wchar_t; but it
would be bad not to have a way to override this default, so we
couldn't exploit the correspondence much.  Some code could be
#ifdef'ed out when Py_UNICODE == wchar_t, but there would always have
to be code to support these two having different sizes.

The outcome of the choice must be available at run-time, because it
may affect certain codecs.  Maybe sys.maxunicode could be the largest
character value supported, i.e. 0xffff or 0xfffff?

A different way to look at it: if we had wanted to use a
variable-lenth internal representation, we should have picked UTF-8
way back, like Perl did.  Moving to a UTF-16-based internal
representation now will give us all the problems of the Perl choice
without any of the benefits.

--Guido van Rossum (home page: http://www.python.org/~guido/)