[I18n-sig] Unicode surrogates: just say no!

Machin, John JMachin@Colonial.com.au
Thu, 28 Jun 2001 08:50:13 +1000

The "nice pair of invariants" for unichr() and ord() seem to involve
what I call "all that variable-length mucking about" and Tim more
robustly called "crap".

IMO, there should be a very short list of places where a narrow 
Unicode implementation will need to know anything at all about
surrogates. This short list will include codecs, the 
\Uxxxxxxxx notation for literals, and unichr() --- the users can 
ship it into the warehouse and ship it out again, but it won't be
processed as other than 16-bit values.  Attempts to place other
items on the list should be rigorously justified.

Guido asked:
   What should ord(u'\uD800\uDC00') mean on a wide interpreter? 

IMO, this should mean an exception on *both* narrow and wide
interpreters, just as ord("xy") does. ord() should expect one
and only one *character*

Let's just keep on saying no!

-----Original Message-----
From: Guido van Rossum [mailto:guido@digicool.com]
Sent: Thursday, 28 June 2001 5:57
To: Paul Prescod
Cc: i18n-sig@python.org
Subject: Re: [I18n-sig] Unicode surrogates: just say no!

> Guido van Rossum wrote:
> > 
> >...
> > 
> > Oooh, hadn't thought of that, but yes, it makes sense!
> > 
> > Not yet implemented, but I think it should.  Makes for a nice pair
> > of invariants:
> > 
> >   unichr(ord('\Udddddddd')) == '\Udddddddd'
> >   ord(unichr(0xdddddddd)) == 0xdddddddd
> > 
> > regardless of whether we're using UCS-2 or UCS-4 storage.
> I'm going to presume that ord should accept surrogate pairs on both
> narrow and wide interpreters.

That's a separate question.  On wide interpreters, surrogate pairs
"shouldn't" exist if the app plays by the rules.  But they're easily
created of course!  What should ord(u'\uD800\uDC00') mean on a wide
interpreter?  I think it's nice if you support this.  Of course, if a
length-two Unicode string is anything else than a high surrogate
followed by a low surrogate, ord() should be illegal.

--Guido van Rossum (home page: http://www.python.org/~guido/)

I18n-sig mailing list

**************   IMPORTANT MESSAGE  **************

The information contained in or attached to this message is intended only for the people it is addressed to. If you are not the intended recipient, any use, disclosure or copying of this information is unauthorised and prohibited. This information may be confidential or subject to legal privilege. It is not the expressed view of Colonial Limited or any of its subsidiaries unless that is clearly stated. Colonial cannot accept liability for any virus damage caused by this message.