[I18n-sig] Unicode surrogates: just say no!

Wed, 27 Jun 2001 16:54:37 -0400

> Guido van Rossum wrote:
> > 
> >..
> > 
> > Users can choose to write code that's portable between the two
> > versions by using surrogates on the narrow platform but not on the
> > wide platform.  (This would be a good idea for backward compatibility
> > with Python 2.0 and 2.1 anyway.)  The proposed (and current!) behavior
> > of \U makes it easy for them to do the right thing with string
> > literals; everything else, they just have to write code that won't
> > separate surrogate halves.
> 
> What is the virtue in making the literal syntax easy and making unichr()
> easy when everything else is hard? Counting characters is hard.
> Addressing characters reliably is hard. Slicing reliably is hard. Why
> not simplify things? Surrogates are just characters. If you want to
> handle wide characters you need to build Python that way.
> 
> I'm trying to imagine the use-case where you care about surrogates
> enough to want them to be automatically generated but not enough to care
> about slicing and addressing and counting and ...and is this use-case
> worth breaking the invariant that len(unichr(i))==1.
> 
> Surrogates: Just say no. :)

\U has supported surrogate creation since Python 2.0 was released, but
I can't find a clear answer in PEP 100 (a.k.a. Misc/unicode.txt; \U
was added after that was finalized).

The use case I've been assuming of is simple enough: someone wants to
print "Hello World" in Klingon.  They have a printing routine that
takes Unicode, but only ASCII keyboard.  They look up the Unicode
values for the Klingon characters spelling "Hello World" in Klingon on
the web.  The characters happen to be in plane 17.  Do we really want
to place the additional burden on them to (a) figure out if their
Python interpreter uses UCS-2 or UCS-4, and (b) correctly implement
the surrogate creation algorithm on the UCS-2 platform?  I don't think
we should.

--Guido van Rossum (home page: http://www.python.org/~guido/)