[Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]

Tue Feb 14 06:20:56 CET 2006

At 04:29 PM 2/13/2006 -0800, Guido van Rossum wrote:
>On 2/13/06, Phillip J. Eby <pje at telecommunity.com> wrote:
> > I didn't mean that it was the only purpose.  In Python 2.x, practical code
> > has to sometimes deal with "string-like" objects.  That is, code that takes
> > either strings or unicode.  If such code calls bytes(), it's going to want
> > to include an encoding so that unicode conversions won't fail.
>
>That sounds like a rather hypothetical example. Have you thought it
>through? Presumably code that accepts both str and unicode either
>doesn't care about encodings, but simply returns objects of the same
>type as the arguments -- and then it's unlikely to want to convert the
>arguments to bytes; or it *does* care about encodings, and then it
>probably already has to special-case str vs. unicode because it has to
>control how str objects are interpreted.

Actually, it's the other way around.  Code that wants to output 
uninterpreted bytes right now and accepts either strings or Unicode has to 
special-case *unicode* -- not str, because str is the only "bytes type" we 
currently have.

This creates an interesting issue in WSGI for Jython, which of course only 
has one (unicode-based) string type now.  Since there's no bytes type in 
Python in general, the only solution we could come up with was to treat 
such strings as latin-1:

     http://www.python.org/peps/pep-0333.html#unicode-issues

This is why I'm biased towards latin-1 encoding of unicode to bytes; it's 
"the same thing" as an uninterpreted string of bytes.

I think the difference in our viewpoints is that you're still thinking 
"string" thoughts, whereas I'm thinking "byte" thoughts.  Bytes are just 
bytes; they don't *have* an encoding.

So, if you think of "converting a string to bytes" as meaning "create an 
array of numerals corresponding to the characters in the string", then this 
leads to a uniform result whether the characters are in a str or a unicode 
object.  In other words, to me, bytes(str_or_unicode) should be treated as:

     bytes(map(ord, str_or_unicode))

In other words, without an encoding, bytes() should simply treat str and 
unicode objects *as if they were a sequence of integers*, and produce an 
error when an integer is out of range.  This is a logical and consistent 
interpretation in the absence of an encoding, because in that case you 
don't care about the encoding - it's just raw data.

If, however, you include an encoding, then you're stating that you want to 
encode the *meaning* of the string, not merely its integer values.

>What would bytes("abc\xf0", "latin-1") *mean*? Take the string
>"abc\xf0", interpret it as being encoded in XXX, and then encode from
>XXX to Latin-1. But what's XXX? As I showed in a previous post,
>"abc\xf0".encode("latin-1") *fails* because the source for the
>encoding is assumed to be ASCII.

I'm saying that XXX would be the same encoding as you specified.  i.e., 
including an encoding means you are encoding the *meaning* of the string.

However, I believe I mainly proposed this as an alternative to having 
bytes(str_or_unicode) work like bytes(map(ord,str_or_unicode)), which I 
think is probably a saner default.

>Your argument for symmetry would be a lot stronger if we used Latin-1
>for the conversion between str and Unicode. But we don't.

But that's because we're dealing with its meaning *as a string*, not merely 
as ordinals in a sequence of bytes.

>  I like the
>other interpretation (which I thought was yours too?) much better: str
><--> bytes conversions don't use encodings by simply change the type
>without changing the bytes;

I like it better too.  The part you didn't like was where MAL and I believe 
this should be extended to Unicode characters in the 0-255 range also.  :)

>There's one property that bytes, str and unicode all share: type(x[0])
>== type(x), at least as long as len(x) >= 1. This is perhaps the
>ultimate test for string-ness.
>
>Or should b[0] be an int, if b is a bytes object? That would change
>things dramatically.

+1 for it being an int.  Heck, I'd want to at least consider the 
possibility of introducing a character type (chr?) in Python 3.0, and 
getting rid of the "iterating a string yields strings" 
characteristic.  I've found it to be a bit of a pain when dealing with 
heterogeneous nested sequences that contain strings.

>There's also the consideration for APIs that, informally, accept
>either a string or a sequence of objects. Many of these exist, and
>they are probably all being converted to support unicode as well as
>str (if it makes sense at all). Should a bytes object be considered as
>a sequence of things, or as a single thing, from the POV of these
>types of APIs? Should we try to standardize how code tests for the
>difference? (Currently all sorts of shortcuts are being taken, from
>isinstance(x, (list, tuple)) to isinstance(x, basestring).)

I'm inclined to think of certain features at least in terms of the buffer 
interface, but that's not something that's really exposed at the Python level.