[Python-Dev] PEP 393 Summer of Code Project
Stephen J. Turnbull
stephen at xemacs.org
Thu Aug 25 02:36:14 CEST 2011
Guido van Rossum writes:
> I see nothing wrong with having the language's fundamental data types
> (i.e., the unicode object, and even the re module) to be defined in
> terms of codepoints, not characters, and I see nothing wrong with
> len() returning the number of codepoints (as long as it is advertised
> as such).
In fact, the Unicode Standard, Version 6, goes farther (to code units):
2.7 Unicode Strings
A Unicode string data type is simply an ordered sequence of code
units. Thus a Unicode 8-bit string is an ordered sequence of 8-bit
code units, a Unicode 16-bit string is an ordered sequence of
16-bit code units, and a Unicode 32-bit string is an ordered
sequence of 32-bit code units.
Depending on the programming environment, a Unicode string may or
may not be required to be in the corresponding Unicode encoding
form. For example, strings in Java, C#, or ECMAScript are Unicode
16-bit strings, but are not necessarily well-formed UTF-16
sequences.
(p. 32).
More information about the Python-Dev
mailing list