[Python-Dev] PEP 393 Summer of Code Project

Wed Aug 31 19:20:19 CEST 2011

On Wed, Aug 31, 2011 at 1:09 AM, Glenn Linderman <v+python at g.nevcal.com> wrote:
> The str type itself can presently be used to process other
> character encodings: if they are fixed width < 32-bit elements those
> encodings might be considered Unicode encodings, but there is no requirement
> that they are, and some operations on str may operate with knowledge of some
> Unicode semantics, so there are caveats.

Actually, the str type in Python 3 and the unicode type in Python 2
are constrained everywhere to either 16-bit or 21-bit "characters".
(Except when writing C code, which can do any number of invalid things
so is the equivalent of assuming 1 == 0.) In particular, on a wide
build, there is no way to get a code point >= 2**21, and I don't want
PEP 393 to change this. So at best we can use these types to repesent
arrays of 21-bit unsigned ints. But I think it is more useful to think
of them as always representing "some form of Unicode", whether that is
UTF-16 (on narrow builds) or 21-bit code points or perhaps some
vaguely similar superset -- but for those code units/code points that
are representable *and* valid (either code points or code units)
according to the (supported version of) the Unicode standard, the
meaning of those code points/units matches that of the standard.

Note that this is different from the bytes type, where the meaning of
a byte is entirely determined by what it means in the programmer's
head.

-- 
--Guido van Rossum (python.org/~guido)