[I18n-sig] How does Python Unicode treat surrogates?

M.-A. Lemburg mal@lemburg.com
Wed, 21 Feb 2001 13:39:26 +0100


Guido van Rossum wrote:
> 
> On the XML sig the following exchange happened.  I don't know enough
> about the issues to investigate, but I'm sure that someone here can
> provide insight?  It seems to boil down to whether or not surrogates
> may get transposed when between platforms.

The Python Unicode implementation assumes that the internal
storage is using UTF-16 *without* surrogates. As a result the
storage scheme is the same as UCS2. This is per design since
surrogates introduce a whole new can of worms (making
UTF-16 a variable length encoding).

Still, there are some codecs (utf-8, utf-16, unicode-escape) 
which try to handle can handle  surrogates properly. The support 
for surrogates is not complete though, so I wouldn't rely on it.

Note that UTF-16 surrogates are only needed to reach Unicode
code points beyond BMP. AFAIK, there are plans to fill this
area in the next Unicode version, but the designers are very
well aware of the issues this imposes on the existing implementations:
Windows and Java are Unicode 2.0 based which is not capable of
handling character points outside BMP.

Does this answer you question ?

> --Guido van Rossum (home page: http://www.python.org/~guido/)
> 
> ------- Forwarded Message
> 
> Date:    Tue, 20 Feb 2001 11:54:34 -0700
> From:    Uche Ogbuji <uche.ogbuji@fourthought.com>
> To:      Guido van Rossum <guido@digicool.com>
> cc:      Lars Marius Garshol <larsga@garshol.priv.no>, xml-sig@python.org
> Subject: Re: [XML-SIG] DC DOM tests (Was: Roadmap document - finally!)
> 
> > > > > - DOMString and text manipulating interface methods are not
> > > > >   tested beyond ASCII text due to an implementation limitation
> > > > >   of ParsedXML.DOM. So, implementations will not be tested if
> > > > >   text is correctly treated when multi-byte UTF-16 characters
> > > > >   are involved.
> > > >
> > > > By "multi-byte UTF-16 characters" I assume you mean Unicode
> > > > characters outside the BMP that are represented using two
> > > > surrogates?
> > >
> > > I wonder if that's what Martijn means.  I've read that most Java
> > > implementations have trouble with characters outside the BMP.  I
> > > wonder if Python handles these properly.
> >
> > Depends on what you call properly.  Can you elaborate on what you
> > would call proper treatment here?
> 
> Sure.  I admit it's hearsay, but I thought I'd read that because Java
> Unicode is or was underspecified, that there was the possibility of
> transposition of the high-surrogate with the low-surrogate character
> between Java implementations or platforms.
> 
> Now I don't exactly write XML dissertations on "Hello Kitty" <g>, so
> I'm not likely to run into this myself, but I was wondering whether
> Python handles surrogate blocks appropriately across platforms and
> implementations (I guess including cpyhton -> Jpython).
> 
> --
> Uche Ogbuji                               Principal Consultant
> uche.ogbuji@fourthought.com               +1 303 583 9900 x 101
> Fourthought, Inc.                         http://Fourthought.com
> 4735 East Walnut St, Ste. C, Boulder, CO 80301-2537, USA
> Software-engineering, knowledge-management, XML, CORBA, Linux, Python
> 
> ------- End of Forwarded Message
> 
> _______________________________________________
> I18n-sig mailing list
> I18n-sig@python.org
> http://mail.python.org/mailman/listinfo/i18n-sig

-- 
Marc-Andre Lemburg
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Pages:                           http://www.lemburg.com/python/