
On Wed, Aug 24, 2011 at 11:37 PM, Terry Reedy <tjreedy@udel.edu> wrote:
On 8/24/2011 1:45 PM, Victor Stinner wrote:
Le 24/08/2011 02:46, Terry Reedy a écrit :
I don't think that using UTF-16 with surrogate pairs is really a big
problem. A lot of work has been done to hide this. For example, repr(chr(0x10ffff)) now displays '\U0010ffff' instead of two characters. Ezio fixed recently str.is*() methods in Python 3.2+.
I greatly appreciate that he did. The * (lower,upper,title) methods apparently are not fixed yet as the corresponding new tests are currently skipped for narrow builds.
There are two reasons for this: 1) the str.is* methods get the string and return True/False, so it's enough to iterate on the string, combine the surrogates, and check if the result islower/upper/etc. Methods like lower/upper/etc, afaiu, currently get only a copy of the string, and modify that in place. The current macros advance to the next char during reading and writing, so it's not possible to use them to read/write from/to the same string. We could either change the macros to not advance the pointer [0] (and do it manually in the other functions like is*) or change the function to get the original string too. 2) I'm on vacation. Best Regards, Ezio Melotti [0]: for lower/upper/title it should be possible to modify the string in place, because these operations never converts a non-BMP char to a BMP one (and vice versa), so if two surrogates are read, two surrogates will be written after the transformation. I'm not sure this will work with all the methods though (e.g. str.translate).