[Python-Dev] 2.2 Unicode questions

M.-A. Lemburg mal@lemburg.com
Thu, 19 Jul 2001 19:41:47 +0200


Guido van Rossum wrote:
> 
> > > Untrue: it supports range(0x110000) (in UCS-2 mode this returns a
> > > surrogate pair).  Now, maybe that's not what it *should* do...
> >
> > It should definitely not, unless you want to break code which assumes
> > that chr() and unichr() always return a single byte/code unit !
> 
> Reasonable people can disagree about this.
> 
> > This was part of the UCS-4 checkins which hadn't had time yet to
> > review. Should I remove the surrogate part for narrow builds ?
> 
> Well, this snuck into the 2.2a1, so hopefully we'll get some comments
> ("love it" / "hate it") from the field to guide our decision.

Waiting for comments from the field :-) 
 
> > > > and there's no \code{\e U} notation for embedding characters
> > > > greater than 65535 in a Unicode string literal.
> > >
> > > Not true either -- correct \U has been part of Python since 2.0.  It
> > > does the same thing as unichr() described above.
> >
> > Right.
> >
> > Note that in this case, the handling of surrogates is needed
> > to make the unicode-escape encoding roundtrip safe.
> 
> I don't understand what this means.  Can you give an example?

It means that the roundtrip Unicode -> encoding -> Unicode is a
1-1 mapping for all Unicode code points. Other examples for 
roundtrip safe encodings are UTF-8 and UT-16.

Looking at the code, I found that the unicode-escape encoder
does not convert Unicode surrogates to \UXXXXXXXX escapes.
I'll fix that.

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Consulting & Company:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/