<div class="gmail_quote">On Fri, Aug 26, 2011 at 1:54 AM, Guido van Rossum <span dir="ltr">&lt;<a href="mailto:guido@python.org">guido@python.org</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">

<div class="im">On Wed, Aug 24, 2011 at 3:06 AM, Terry Reedy &lt;<a href="mailto:tjreedy@udel.edu">tjreedy@udel.edu</a>&gt; wrote:<br>

&gt; Excuse me for believing the fine 3.2 manual that says<br>

&gt; &quot;Strings contain Unicode characters.&quot; (And to a naive reader, that implies<br>

&gt; that string iteration and indexing should produce Unicode characters.)<br>

<br>

</div>The naive reader also doesn&#39;t know the difference between characters,<br>

code points and code units. It&#39;s the advanced, Unicode-aware reader<br>

who is confused by this phrase in the docs. It should say code units;<br>

or perhaps code units for narrow builds and code points for wide<br>

builds.</blockquote><div><br>For UTF-16/32 (i.e. narrow/wide), talking about &quot;code units&quot;[0] should be correct.  Also note that:<br>  * for both, every &quot;code unit&quot; has a specific &quot;codepoint&quot; (including lone surrogates), so it might be OK to talk about &quot;codepoints&quot; too, but<br>

  * only for wide builds every &quot;codepoints&quot; is represented by a single, 32-bits &quot;code unit&quot;.  In narrow builds, non-BMP chars are represented by a &quot;code unit sequence&quot; of two elements (i.e. a &quot;surrogate pair&quot;).<br>

<br>

Since &quot;code unit&quot; refers to the *minimal* bit combination, in UTF-8 characters that needs 2/3/4 

bytes, are represented with a &quot;code unit sequence&quot; made of 2/3/4 &quot;code units&quot; (so in UTF-8 &quot;code units&quot; and 

&quot;code points&quot; overlaps only for the ASCII range).<br> </div><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;"> With PEP 393 we can unconditionally say code points, which is<br>


much better. We should try to remove our use of &quot;characters&quot; -- or<br>

else we should *define* our use of the term &quot;characters&quot; as &quot;what the<br>

Unicode standard calls code points&quot;.<br></blockquote><div><br>Character usually works fine, especially for naive readers.  Even Unicode-aware readers often confuse between the several terms, so using a simple term and pointing to a more accurate description sounds like a better idea to me.<br>

</div><div><br>Note that there&#39;s also another important term[1]:<br>&quot;&quot;&quot;<br><em><a name="unicode_scalar_value">Unicode Scalar Value</a></em>. Any Unicode <i>

            <a href="http://unicode.org/glossary/#code_point">code point</a></i> except high-surrogate 

            and low-surrogate code points. In other words, the ranges of 

            integers 0 to D7FF<sub>16</sub> and E000<sub>16</sub> to 10FFFF<sub>16</sub> 

            inclusive.<br>&quot;&quot;&quot;<br>For example the UTF codecs produce sequences of &quot;code units&quot; (of 8, 16, 32 bits) that represent &quot;scalar values&quot;[2][3]:<br><br>Chapter 3 [4] says:<br>&quot;&quot;&quot;<br>

</div></div>3.9 Unicode Encoding Forms<br>The Unicode Standard supports three character encoding forms: UTF-32, UTF-16, and UTF-8. Each encoding form maps the Unicode code points U+0000..U+D7FF and U+E000..U+10FFFF to unique code unit sequences. [...]<br>

 D76 Unicode scalar value: Any Unicode code point except high-surrogate and low-surrogate code points.<br>     • As a result of this definition, the set of Unicode scalar values consists of the ranges 0 to D7FF and E000 to 10FFFF, inclusive.<br>

 D77 Code unit: The minimal bit combination that can represent a unit of encoded text for processing or interchange.<br>[...]<br> D79 A Unicode encoding form assigns each Unicode scalar value to a unique code unit sequence.<br>

&quot;&quot;&quot;<br><br>On the other hand, Python Unicode strings are not limited to scalar values, because they can also contain lone surrogates.<br><br><br>I hope this helps clarify the terminology a bit and doesn&#39;t add more confusion, but if we want to use the Unicode terms we should get them right.  (Also note that I might have misunderstood something, even if I&#39;ve been careful with the terms and I double-checked and quoted the relevant parts of the Unicode standard.)<br>

<br>Best Regards,<br>Ezio Melotti<br><br><br>[0]: From the chapter 3 [4],<br> D77 Code unit: The minimal bit combination that can represent a unit of encoded text for processing or interchange.<br>   • Code units are particular units of computer storage. Other character encoding standards typically use code units defined as 8-bit units—that is, octets.<br>

     The Unicode Standard uses 8-bit code units in the UTF-8 encoding form, 16-bit code units in the UTF-16 encoding form, and 32-bit code units in the UTF-32 encoding form.<br>[1]: <a href="http://unicode.org/glossary/#unicode_scalar_value">http://unicode.org/glossary/#unicode_scalar_value</a><br>

[2]: Apparently Python 3 raises an error while encoding lone surrogates in UTF-8, but it doesn&#39;t for UTF-16 and UTF-32.<br>From the chapter 3 [4],<br> D91: &quot;Because surrogate code points are not Unicode scalar values, isolated UTF-16 code units in the range 0xD800..0xDFFF are ill-formed.&quot;<br>

 D92: &quot;Because surrogate code points are not included in the set of Unicode scalar values, UTF-32 code units in the range 0x0000D800..0x0000DFFF are ill-formed.&quot;<br>I think this should be fixed.<br>[3]: Note that I&#39;m talking about codecs used to encode/decode Unicode strings to/from bytes here, it&#39;s perfectly fine for Python itself to represent lone surrogates in its *internal* representations, regardless of what encoding it&#39;s using.<br>

[4]: Chapter 3: <a href="http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf">http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf</a>