<div class="gmail_quote">On Fri, Aug 26, 2011 at 1:54 AM, Guido van Rossum <span dir="ltr"><<a href="mailto:guido@python.org">guido@python.org</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">
<div class="im">On Wed, Aug 24, 2011 at 3:06 AM, Terry Reedy <<a href="mailto:tjreedy@udel.edu">tjreedy@udel.edu</a>> wrote:<br>
> Excuse me for believing the fine 3.2 manual that says<br>
> "Strings contain Unicode characters." (And to a naive reader, that implies<br>
> that string iteration and indexing should produce Unicode characters.)<br>
<br>
</div>The naive reader also doesn't know the difference between characters,<br>
code points and code units. It's the advanced, Unicode-aware reader<br>
who is confused by this phrase in the docs. It should say code units;<br>
or perhaps code units for narrow builds and code points for wide<br>
builds.</blockquote><div><br>For UTF-16/32 (i.e. narrow/wide), talking about "code units"[0] should be correct. Also note that:<br> * for both, every "code unit" has a specific "codepoint" (including lone surrogates), so it might be OK to talk about "codepoints" too, but<br>
* only for wide builds every "codepoints" is represented by a single, 32-bits "code unit". In narrow builds, non-BMP chars are represented by a "code unit sequence" of two elements (i.e. a "surrogate pair").<br>
<br>
Since "code unit" refers to the *minimal* bit combination, in UTF-8 characters that needs 2/3/4
bytes, are represented with a "code unit sequence" made of 2/3/4 "code units" (so in UTF-8 "code units" and
"code points" overlaps only for the ASCII range).<br> </div><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;"> With PEP 393 we can unconditionally say code points, which is<br>
much better. We should try to remove our use of "characters" -- or<br>
else we should *define* our use of the term "characters" as "what the<br>
Unicode standard calls code points".<br></blockquote><div><br>Character usually works fine, especially for naive readers. Even Unicode-aware readers often confuse between the several terms, so using a simple term and pointing to a more accurate description sounds like a better idea to me.<br>
</div><div><br>Note that there's also another important term[1]:<br>"""<br><em><a name="unicode_scalar_value">Unicode Scalar Value</a></em>. Any Unicode <i>
<a href="http://unicode.org/glossary/#code_point">code point</a></i> except high-surrogate
and low-surrogate code points. In other words, the ranges of
integers 0 to D7FF<sub>16</sub> and E000<sub>16</sub> to 10FFFF<sub>16</sub>
inclusive.<br>"""<br>For example the UTF codecs produce sequences of "code units" (of 8, 16, 32 bits) that represent "scalar values"[2][3]:<br><br>Chapter 3 [4] says:<br>"""<br>
</div></div>3.9 Unicode Encoding Forms<br>The Unicode Standard supports three character encoding forms: UTF-32, UTF-16, and UTF-8. Each encoding form maps the Unicode code points U+0000..U+D7FF and U+E000..U+10FFFF to unique code unit sequences. [...]<br>
D76 Unicode scalar value: Any Unicode code point except high-surrogate and low-surrogate code points.<br> • As a result of this definition, the set of Unicode scalar values consists of the ranges 0 to D7FF and E000 to 10FFFF, inclusive.<br>
D77 Code unit: The minimal bit combination that can represent a unit of encoded text for processing or interchange.<br>[...]<br> D79 A Unicode encoding form assigns each Unicode scalar value to a unique code unit sequence.<br>
"""<br><br>On the other hand, Python Unicode strings are not limited to scalar values, because they can also contain lone surrogates.<br><br><br>I hope this helps clarify the terminology a bit and doesn't add more confusion, but if we want to use the Unicode terms we should get them right. (Also note that I might have misunderstood something, even if I've been careful with the terms and I double-checked and quoted the relevant parts of the Unicode standard.)<br>
<br>Best Regards,<br>Ezio Melotti<br><br><br>[0]: From the chapter 3 [4],<br> D77 Code unit: The minimal bit combination that can represent a unit of encoded text for processing or interchange.<br> • Code units are particular units of computer storage. Other character encoding standards typically use code units defined as 8-bit units—that is, octets.<br>
The Unicode Standard uses 8-bit code units in the UTF-8 encoding form, 16-bit code units in the UTF-16 encoding form, and 32-bit code units in the UTF-32 encoding form.<br>[1]: <a href="http://unicode.org/glossary/#unicode_scalar_value">http://unicode.org/glossary/#unicode_scalar_value</a><br>
[2]: Apparently Python 3 raises an error while encoding lone surrogates in UTF-8, but it doesn't for UTF-16 and UTF-32.<br>From the chapter 3 [4],<br> D91: "Because surrogate code points are not Unicode scalar values, isolated UTF-16 code units in the range 0xD800..0xDFFF are ill-formed."<br>
D92: "Because surrogate code points are not included in the set of Unicode scalar values, UTF-32 code units in the range 0x0000D800..0x0000DFFF are ill-formed."<br>I think this should be fixed.<br>[3]: Note that I'm talking about codecs used to encode/decode Unicode strings to/from bytes here, it's perfectly fine for Python itself to represent lone surrogates in its *internal* representations, regardless of what encoding it's using.<br>
[4]: Chapter 3: <a href="http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf">http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf</a>