[Python-Dev] PEP 393 Summer of Code Project

Thu Aug 25 04:33:51 CEST 2011

On Wed, Aug 24, 2011 at 5:36 PM, Stephen J. Turnbull <stephen at xemacs.org> wrote:
> Guido van Rossum writes:
>
>  > I see nothing wrong with having the language's fundamental data types
>  > (i.e., the unicode object, and even the re module) to be defined in
>  > terms of codepoints, not characters, and I see nothing wrong with
>  > len() returning the number of codepoints (as long as it is advertised
>  > as such).
>
> In fact, the Unicode Standard, Version 6, goes farther (to code units):
>
>    2.7  Unicode Strings
>
>    A Unicode string data type is simply an ordered sequence of code
>    units. Thus a Unicode 8-bit string is an ordered sequence of 8-bit
>    code units, a Unicode 16-bit string is an ordered sequence of
>    16-bit code units, and a Unicode 32-bit string is an ordered
>    sequence of 32-bit code units.
>
>    Depending on the programming environment, a Unicode string may or
>    may not be required to be in the corresponding Unicode encoding
>    form. For example, strings in Java, C#, or ECMAScript are Unicode
>    16-bit strings, but are not necessarily well-formed UTF-16
>    sequences.
>
> (p. 32).

I am assuming that that definition only applies to use of the term
"unicode string" within the standard and has no bearing on how
programming languages are allowed to use the term, as that would be
preposterous. (They can define what they mean by terms like
well-formed and conforming etc., and I won't try to go against that.
But limiting what can be called a unicode string feels like
unproductive coddling.)

-- 
--Guido van Rossum (python.org/~guido)