[Python-Dev] PEP 393 Summer of Code Project

Thu Sep 1 00:44:59 CEST 2011

On Thu, Sep 1, 2011 at 8:02 AM, Terry Reedy <tjreedy at udel.edu> wrote:
> On 8/31/2011 1:10 PM, Guido van Rossum wrote:
>> Ok, I dig this, to some extent. However saying it is UCS-2 is equally
>> bad.
>
> As I said on the tracker, our narrow builds are in-between (while moving
> closer to UTF-16), and both terms are deceptive, at least to some.

We should probably just explicitly document that the internal
representation in narrow builds is a UCS-2/UTF-16 hybrid - like
UTF-16, it can handle the full code point space, but, like UCS-2, it
allows code unit sequences (such as lone surrogates) that strict
UTF-16 would reject.

Perhaps we should also finally split strings out to a dedicated
section on the same tier as Sequence types in the library reference.
Yes, they're sequences, but they're also so much more than that (try
as you might, you're unlikely to be successful in ducktyping strings
the way you can sequences, mappings, files, numbers and other
interfaces. Needing a "real string" is even more common than needing a
"real dict", especially after the efforts to make most parts of the
interpreter that previously cared about the latter distinction accept
arbitrary mapping objects).

I've created http://bugs.python.org/issue12874, suggesting that the
"Sequence Types" and "memoryview type" sections could be usefully
rearranged as:

    Sequence Types - list, tuple, range
    Text Data - str
    Binary Data - bytes, bytearray, memoryview

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia