[I18n-sig] How does Python Unicode treat surrogates?

Guido van Rossum guido@digicool.com
Tue, 20 Feb 2001 14:36:35 -0500

On the XML sig the following exchange happened.  I don't know enough
about the issues to investigate, but I'm sure that someone here can
provide insight?  It seems to boil down to whether or not surrogates
may get transposed when between platforms.

--Guido van Rossum (home page: http://www.python.org/~guido/)

------- Forwarded Message

Date:    Tue, 20 Feb 2001 11:54:34 -0700
From:    Uche Ogbuji <uche.ogbuji@fourthought.com>
To:      Guido van Rossum <guido@digicool.com>
cc:      Lars Marius Garshol <larsga@garshol.priv.no>, xml-sig@python.org
Subject: Re: [XML-SIG] DC DOM tests (Was: Roadmap document - finally!) 

> > > > - DOMString and text manipulating interface methods are not
> > > >   tested beyond ASCII text due to an implementation limitation
> > > >   of ParsedXML.DOM. So, implementations will not be tested if
> > > >   text is correctly treated when multi-byte UTF-16 characters
> > > >   are involved.
> > > 
> > > By "multi-byte UTF-16 characters" I assume you mean Unicode
> > > characters outside the BMP that are represented using two
> > > surrogates?
> > 
> > I wonder if that's what Martijn means.  I've read that most Java
> > implementations have trouble with characters outside the BMP.  I
> > wonder if Python handles these properly.
> Depends on what you call properly.  Can you elaborate on what you
> would call proper treatment here?

Sure.  I admit it's hearsay, but I thought I'd read that because Java
Unicode is or was underspecified, that there was the possibility of
transposition of the high-surrogate with the low-surrogate character
between Java implementations or platforms.

Now I don't exactly write XML dissertations on "Hello Kitty" <g>, so
I'm not likely to run into this myself, but I was wondering whether
Python handles surrogate blocks appropriately across platforms and
implementations (I guess including cpyhton -> Jpython).

Uche Ogbuji                               Principal Consultant
uche.ogbuji@fourthought.com               +1 303 583 9900 x 101
Fourthought, Inc.                         http://Fourthought.com 
4735 East Walnut St, Ste. C, Boulder, CO 80301-2537, USA
Software-engineering, knowledge-management, XML, CORBA, Linux, Python

------- End of Forwarded Message