<div class="gmail_quote">On Wed, Aug 24, 2011 at 11:37 PM, Terry Reedy <span dir="ltr"><<a href="mailto:tjreedy@udel.edu">tjreedy@udel.edu</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">
<div class="im">On 8/24/2011 1:45 PM, Victor Stinner wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
Le 24/08/2011 02:46, Terry Reedy a écrit :<br>
</blockquote>
<br>
</div><div class="im"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
I don't think that using UTF-16 with surrogate pairs is really a big<br>
problem. A lot of work has been done to hide this. For example,<br>
repr(chr(0x10ffff)) now displays '\U0010ffff' instead of two characters.<br>
Ezio fixed recently <a href="http://str.is" target="_blank">str.is</a>*() methods in Python 3.2+.<br>
</blockquote>
<br></div>
I greatly appreciate that he did. The * (lower,upper,title) methods apparently are not fixed yet as the corresponding new tests are currently skipped for narrow builds.</blockquote><div><br>There are two reasons for this:<br>
1) the <a href="http://str.is/" target="_blank">str.is</a>* methods get
the string and return True/False, so it's enough to iterate on the
string, combine the surrogates, and check if the result
islower/upper/etc.<br>Methods like lower/upper/etc, afaiu, currently get
only a copy of the string, and modify that in place. The current
macros advance to the next char during reading and writing, so it's not
possible to use them to read/write from/to the same string. We could
either change the macros to not advance the pointer [0] (and do it
manually in the other functions like is*) or change the function to get
the original string too.<br>
2) I'm on vacation.<br><br>Best Regards,<br>Ezio Melotti<br><br>[0]: for
lower/upper/title it should be possible to modify the string in place,
because these operations never converts a non-BMP char to a BMP one (and
vice versa), so if two surrogates are read, two surrogates will be
written after the transformation. I'm not sure this will work with all
the methods though (e.g. str.translate).<br></div></div>