[Python-Dev] 2.2 Unicode questions
M.-A. Lemburg
mal@lemburg.com
Thu, 19 Jul 2001 19:41:47 +0200
Guido van Rossum wrote:
>
> > > Untrue: it supports range(0x110000) (in UCS-2 mode this returns a
> > > surrogate pair). Now, maybe that's not what it *should* do...
> >
> > It should definitely not, unless you want to break code which assumes
> > that chr() and unichr() always return a single byte/code unit !
>
> Reasonable people can disagree about this.
>
> > This was part of the UCS-4 checkins which hadn't had time yet to
> > review. Should I remove the surrogate part for narrow builds ?
>
> Well, this snuck into the 2.2a1, so hopefully we'll get some comments
> ("love it" / "hate it") from the field to guide our decision.
Waiting for comments from the field :-)
> > > > and there's no \code{\e U} notation for embedding characters
> > > > greater than 65535 in a Unicode string literal.
> > >
> > > Not true either -- correct \U has been part of Python since 2.0. It
> > > does the same thing as unichr() described above.
> >
> > Right.
> >
> > Note that in this case, the handling of surrogates is needed
> > to make the unicode-escape encoding roundtrip safe.
>
> I don't understand what this means. Can you give an example?
It means that the roundtrip Unicode -> encoding -> Unicode is a
1-1 mapping for all Unicode code points. Other examples for
roundtrip safe encodings are UTF-8 and UT-16.
Looking at the code, I found that the unicode-escape encoder
does not convert Unicode surrogates to \UXXXXXXXX escapes.
I'll fix that.
--
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Consulting & Company: http://www.egenix.com/
Python Software: http://www.lemburg.com/python/