[I18n-sig] Unicode surrogates: just say no!

Tue, 26 Jun 2001 11:54:36 +0200

Guido van Rossum wrote:
> 
> I'm trying to reset this discussion to come to some sort of
> conclusion.  There's been a lot of useful input; I believe I've read
> and understood it all.  May the new thread subject serve as a summary
> of my position. :-)
> 
> Terminology: "character" is a Unicode code point; "unit" is a storage
> unit, i.e. a 16-bit or 32-bit value.  A "surrogate pair" is two
> 16-bit storage units with special values that represent a single
> character.  I'll use "surrogate" for a single storage unit whose value
> indicates that it should be part of a surrogate pair.  The variable u
> is a Python Unicode string object of some sort.
> 
> There are several possible options for representing Unicode strings:
> 
> 1. The current situation.  I'd say that this uses UCS-2 for storage;
>    it doesn't pay any attention to surrogates.  u[i] might be a lone
>    surrogate.  unicode(i) where i is a lone surrogate value returns a
>    string containing a lone surrogate.  An application could use the
>    unicode data type to store UTF-16 units, but it would have to be
>    aware of all the rules pertaining to surrogates.  The codecs,
>    however, are surrogate-unaware.  (Am I right that even the UTF-16
>    codec pays *no* special attention to surrogates?)

The UTF-16 decoder will raise an exception if it sees a surrogate.
The encoder write the internal format as-is without checking for
surrogate usage.

The UTF-8 codec is fully surrogate aware and will translate
the input into UTF-16 surrogates if necessary. The encoder
will translate UTF-16 surrogates into UTF-8 representations
of the code point.

> 2. The compromise proposal.  This uses true UTF-16 for storage and
>    changes the interface to always deal in characters.  unichr(i)
>    where i is a lone surrogate is illegal, and so are the
>    corresponding \u and \U encodings.  unichr(i) for 0x10000 <= i <
>    0x100000 will return a one-character string that happens to be
>    represented using a surrogate pair, but there's no way in Python to
>    find out (short of knowing the implementation).  Codecs that are
>    capable of encoding full Unicode need to be aware of surrogate
>    pairs.
> 
> 3. The ideal situation.  This uses UCS-4 for storage and doesn't
>    require any support for surrogates except in the UTF-16 codecs (and
>    maybe in the UTF-8 codecs; it seems that encoded surrogate pairs
>    are legal in UTF-8 streams but should be converted back to a single
>    character).

The support is require in all Unicode codecs (UTF-n, unicode-escape
and raw-unicode-escape).

>    It's unclear to me whether the (illegal, according to
>    the Unicode standard) "characters" whose numerical value looks like
>    a lone surrogate should be entirely ruled out here, or whether a
>    dedicated programmer could create strings containing these. 

As Mark Davis told me, isolated surrogates are legal code
points, but the resulting sequence is not a legal Unicode
character sequence, sinde these code point (like a few others
as well) are not considered characters.

After all this discussion and the feedback from the Unicode
mailing list, I think we should leave surrogate handling
solely to the codecs and not deal with them in the internal
storage. That is, it is the applications responsability to
make sure to create proper sequences of code points which can
be used as character sequences. 

The codecs, OTOH, should be aware of what is and what is not
considered a legal sequence. The default handling should be to
follow the Unicode Consortium standard. If someone wants to
have additional codecs which implement the ISO 10646 view of things
with respect to UTF-n handling, then these can easily be supported
by codec extensions packages.

>    We
>    could make it hard by declaring unichr(i) with surrogate i and \u
>    and \U escapes that encode surrogates illegal, and by adding
>    explicit checks to codecs as appropriate, but a C extension could
>    still create an array containing illegal characters unless we do
>    draconian input checking.

See above: it's better to leave these decisions to the applications
using the Unicode implementation.

> ...choose option 3...
>
> The only remaining question is how to provide an upgrade path to
> option 3:
> 
> A. At some Python version, we switch.

Like Fredrik said: as soon as the implementation is ready.

> B. Choose between 1 and 3 based on the platform.
> 
> C. Make it a configuration-time choice.
> 
> D. Make it a run-time choice.

I'd rather not make it a choice: let's go with UCS-4 and be
done with these problems once and for all !

As side effect, you could then also enjoy Unicode on Crays :-)

Instead of adding an option which allows selecting between
2 or 4 bytes per code unit, I think people would rather like
to see for disabling Unicode support completely (I know that 
the Pippy Team would :-).

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/