Re: [Python-Dev] PEP 393 Summer of Code Project

Aug. 26, 2011

      On Wed, Aug 24, 2011 at 11:37 PM, Terry Reedy <tjreedy@udel.edu> wrote:
...
On 8/24/2011 1:45 PM, Victor Stinner wrote:
...
Le 24/08/2011 02:46, Terry Reedy a écrit :
I don't think that using UTF-16 with surrogate pairs is really a big
...
problem. A lot of work has been done to hide this. For example,
repr(chr(0x10ffff)) now displays '\U0010ffff' instead of two characters.
Ezio fixed recently str.is*() methods in Python 3.2+.
I greatly appreciate that he did. The * (lower,upper,title) methods
apparently are not fixed yet as the corresponding new tests are currently
skipped for narrow builds.
There are two reasons for this:
1) the str.is* methods get the string and return True/False, so it's enough
to iterate on the string, combine the surrogates, and check if the result
islower/upper/etc.
Methods like lower/upper/etc, afaiu, currently get only a copy of the
string, and modify that in place.  The current macros advance to the next
char during reading and writing, so it's not possible to use them to
read/write from/to the same string.  We could either change the macros to
not advance the pointer [0] (and do it manually in the other functions like
is*) or change the function to get the original string too.
2) I'm on vacation.

Best Regards,
Ezio Melotti

[0]: for lower/upper/title it should be possible to modify the string in
place, because these operations never converts a non-BMP char to a BMP one
(and vice versa), so if two surrogates are read, two surrogates will be
written after the transformation.  I'm not sure this will work with all the
methods though (e.g. str.translate).