<html>

  <head>

    <meta content="text/html; charset=UTF-8" http-equiv="Content-Type">

  </head>

  <body bgcolor="#FFFFFF" text="#330033">

    On 8/31/2011 10:20 AM, Guido van Rossum wrote:

    <blockquote

cite="mid:CAP7+vJJtYZ8vspUimoMh1j6ye6SbfT-eT-YPMG3htU+2NPSNXA@mail.gmail.com"

      type="cite">

      <pre wrap="">On Wed, Aug 31, 2011 at 1:09 AM, Glenn Linderman <a class="moz-txt-link-rfc2396E" href="mailto:v+python@g.nevcal.com">&lt;v+python@g.nevcal.com&gt;</a> wrote:

</pre>

      <blockquote type="cite">

        <pre wrap="">The str type itself can presently be used to process other

character encodings: if they are fixed width &lt; 32-bit elements those

encodings might be considered Unicode encodings, but there is no requirement

that they are, and some operations on str may operate with knowledge of some

Unicode semantics, so there are caveats.

</pre>

      </blockquote>

      <pre wrap="">

Actually, the str type in Python 3 and the unicode type in Python 2

are constrained everywhere to either 16-bit or 21-bit "characters".

(Except when writing C code, which can do any number of invalid things

so is the equivalent of assuming 1 == 0.) In particular, on a wide

build, there is no way to get a code point &gt;= 2**21, and I don't want

PEP 393 to change this. So at best we can use these types to repesent

arrays of 21-bit unsigned ints. But I think it is more useful to think

of them as always representing "some form of Unicode", whether that is

UTF-16 (on narrow builds) or 21-bit code points or perhaps some

vaguely similar superset -- but for those code units/code points that

are representable *and* valid (either code points or code units)

according to the (supported version of) the Unicode standard, the

meaning of those code points/units matches that of the standard.

Note that this is different from the bytes type, where the meaning of

a byte is entirely determined by what it means in the programmer's

head.

</pre>

    </blockquote>

    <br>

    Sorry, my Perl background is leaking through.  I didn't double check

    that str constrains the values of each element to range 0x110000 but

    I see now by testing that it does.  For some of my ideas, then,

    either a subtype of str would have to be able to relax that

    constraint, or str would not be the appropriate base type to use

    (but there are other base types that could be used, so this is not a

    serious issue for the ideas).<br>

    <br>

    I have no problem with thinking of str as representing "some form of

    Unicode".  None of my proposals change that, although they may

    change other things, and may invent new forms of Unicode

    representations. You have stated that it is better to document what

    str actually does, rather than attempt to adhere slavishly to

    Unicode standard concepts.  The Unicode Consortium may well define

    legal, conforming bytestreams for communicating processes, but

    languages and applications are free to use other representations

    internally.  We can either artificially constrain ourselves to minor

    tweaks of the legal conforming bytestreams, or we can invent a

    representation (whether called str or something else) that is useful

    and efficient in practice.<br>

  </body>

</html>