<br><br><div><span class="gmail_quote">On 9/20/06, <b class="gmail_sendername">Adam Olsen</b> <<a href="mailto:email@example.com">firstname.lastname@example.org</a>> wrote:</span><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
On 9/20/06, Guido van Rossum <<a href="mailto:email@example.com">firstname.lastname@example.org</a>> wrote:<br>> On 9/20/06, Adam Olsen <<a href="mailto:email@example.com">firstname.lastname@example.org</a>> wrote:<br>> > Before we can decide on the internal representation of our unicode
<br>> > objects, we need to decide on their external interface. My thoughts<br>> > so far:<br>><br>> Let me cut this short. The external string API in Py3k should not<br>> change or only very marginally so (like removing rarely used useless
<br>> APIs or adding a few new conveniences). The plan is to keep the 2.x<br>> API that is supported (in 2.x) by both str and unicode, but merge the<br>> twp string types into one. Anything else could be done just as easily
<br>> before or after Py3k.<br><br>Thanks, but one thing remains unclear: is the indexing intended to<br>represent bytes, code points, or code units? Note that C code<br>operating on UTF-16 would use code units for slicing of UTF-16, which
<br>splits surrogate pairs.</blockquote><div><br>Assuming my Unicode lingo is right and code point represents a letter/character/digraph/whatever, then it will be a code point. Doing one of my rare channels of Guido, I *really* doubt he wants to expose the technical details of Unicode to the point of having people need to realize that UTF-8 takes two bytes to represent "ö". If you want that kind of exposure, use the bytes type. Otherwise assume the usage will be by people ignorant of Unicode and thus want something that will work the way they are used to when compared to working in ASCII.