[Python-Dev] PEP 393 Summer of Code Project

Fri Aug 26 03:40:33 CEST 2011

On Fri, Aug 26, 2011 at 1:54 AM, Guido van Rossum <guido at python.org> wrote:

> On Wed, Aug 24, 2011 at 3:06 AM, Terry Reedy <tjreedy at udel.edu> wrote:
> > Excuse me for believing the fine 3.2 manual that says
> > "Strings contain Unicode characters." (And to a naive reader, that
> implies
> > that string iteration and indexing should produce Unicode characters.)
>
> The naive reader also doesn't know the difference between characters,
> code points and code units. It's the advanced, Unicode-aware reader
> who is confused by this phrase in the docs. It should say code units;
> or perhaps code units for narrow builds and code points for wide
> builds.

For UTF-16/32 (i.e. narrow/wide), talking about "code units"[0] should be
correct.  Also note that:
  * for both, every "code unit" has a specific "codepoint" (including lone
surrogates), so it might be OK to talk about "codepoints" too, but
  * only for wide builds every "codepoints" is represented by a single,
32-bits "code unit".  In narrow builds, non-BMP chars are represented by a
"code unit sequence" of two elements (i.e. a "surrogate pair").

Since "code unit" refers to the *minimal* bit combination, in UTF-8
characters that needs 2/3/4 bytes, are represented with a "code unit
sequence" made of 2/3/4 "code units" (so in UTF-8 "code units" and "code
points" overlaps only for the ASCII range).

> With PEP 393 we can unconditionally say code points, which is
> much better. We should try to remove our use of "characters" -- or
> else we should *define* our use of the term "characters" as "what the
> Unicode standard calls code points".
>

Character usually works fine, especially for naive readers.  Even
Unicode-aware readers often confuse between the several terms, so using a
simple term and pointing to a more accurate description sounds like a better
idea to me.

Note that there's also another important term[1]:
"""
*Unicode Scalar Value*. Any Unicode * code
point<http://unicode.org/glossary/#code_point>
* except high-surrogate and low-surrogate code points. In other words, the
ranges of integers 0 to D7FF16 and E00016 to 10FFFF16 inclusive.
"""
For example the UTF codecs produce sequences of "code units" (of 8, 16, 32
bits) that represent "scalar values"[2][3]:

Chapter 3 [4] says:
"""
3.9 Unicode Encoding Forms
The Unicode Standard supports three character encoding forms: UTF-32,
UTF-16, and UTF-8. Each encoding form maps the Unicode code points
U+0000..U+D7FF and U+E000..U+10FFFF to unique code unit sequences. [...]
 D76 Unicode scalar value: Any Unicode code point except high-surrogate and
low-surrogate code points.
     • As a result of this definition, the set of Unicode scalar values
consists of the ranges 0 to D7FF and E000 to 10FFFF, inclusive.
 D77 Code unit: The minimal bit combination that can represent a unit of
encoded text for processing or interchange.
[...]
 D79 A Unicode encoding form assigns each Unicode scalar value to a unique
code unit sequence.
"""

On the other hand, Python Unicode strings are not limited to scalar values,
because they can also contain lone surrogates.

I hope this helps clarify the terminology a bit and doesn't add more
confusion, but if we want to use the Unicode terms we should get them
right.  (Also note that I might have misunderstood something, even if I've
been careful with the terms and I double-checked and quoted the relevant
parts of the Unicode standard.)

Best Regards,
Ezio Melotti

[0]: From the chapter 3 [4],
 D77 Code unit: The minimal bit combination that can represent a unit of
encoded text for processing or interchange.
   • Code units are particular units of computer storage. Other character
encoding standards typically use code units defined as 8-bit units—that is,
octets.
     The Unicode Standard uses 8-bit code units in the UTF-8 encoding form,
16-bit code units in the UTF-16 encoding form, and 32-bit code units in the
UTF-32 encoding form.
[1]: http://unicode.org/glossary/#unicode_scalar_value
[2]: Apparently Python 3 raises an error while encoding lone surrogates in
UTF-8, but it doesn't for UTF-16 and UTF-32.
>From the chapter 3 [4],
 D91: "Because surrogate code points are not Unicode scalar values, isolated
UTF-16 code units in the range 0xD800..0xDFFF are ill-formed."
 D92: "Because surrogate code points are not included in the set of Unicode
scalar values, UTF-32 code units in the range 0x0000D800..0x0000DFFF are
ill-formed."
I think this should be fixed.
[3]: Note that I'm talking about codecs used to encode/decode Unicode
strings to/from bytes here, it's perfectly fine for Python itself to
represent lone surrogates in its *internal* representations, regardless of
what encoding it's using.
[4]: Chapter 3: http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-dev/attachments/20110826/1959223e/attachment.html>