<div class="gmail_quote">On Wed, Aug 24, 2011 at 11:37 PM, Terry Reedy <span dir="ltr">&lt;<a href="mailto:tjreedy@udel.edu">tjreedy@udel.edu</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">

<div class="im">On 8/24/2011 1:45 PM, Victor Stinner wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

Le 24/08/2011 02:46, Terry Reedy a écrit :<br>

</blockquote>

<br>

</div><div class="im"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

I don&#39;t think that using UTF-16 with surrogate pairs is really a big<br>

problem. A lot of work has been done to hide this. For example,<br>

repr(chr(0x10ffff)) now displays &#39;\U0010ffff&#39; instead of two characters.<br>

Ezio fixed recently <a href="http://str.is" target="_blank">str.is</a>*() methods in Python 3.2+.<br>

</blockquote>

<br></div>

I greatly appreciate that he did. The * (lower,upper,title) methods apparently are not fixed yet as the corresponding new tests are currently skipped for narrow builds.</blockquote><div><br>There are two reasons for this:<br>


1) the <a href="http://str.is/" target="_blank">str.is</a>* methods get 

the string and return True/False, so it&#39;s enough to iterate on the 

string, combine the surrogates, and check if the result 

islower/upper/etc.<br>Methods like lower/upper/etc, afaiu, currently get

 only a copy of the string, and modify that in place.  The current 

macros advance to the next char during reading and writing, so it&#39;s not 

possible to use them to read/write from/to the same string.  We could 

either change the macros to not advance the pointer [0] (and do it 

manually in the other functions like is*) or change the function to get 

the original string too.<br>

2) I&#39;m on vacation.<br><br>Best Regards,<br>Ezio Melotti<br><br>[0]: for

 lower/upper/title it should be possible to modify the string in place, 

because these operations never converts a non-BMP char to a BMP one (and

 vice versa), so if two surrogates are read, two surrogates will be 

written after the transformation.  I&#39;m not sure this will work with all 

the methods though (e.g. str.translate).<br></div></div>