[Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]

Tue Feb 14 17:25:01 CET 2006

At 11:08 AM 2/14/2006 -0500, James Y Knight wrote:

>On Feb 14, 2006, at 1:52 AM, Martin v. Löwis wrote:
>
>>Phillip J. Eby wrote:
>>>I was just pointing out that since byte strings are bytes by
>>>definition,
>>>then simply putting those bytes in a bytes() object doesn't alter the
>>>existing encoding.  So, using latin-1 when converting a string to
>>>bytes
>>>actually seems like the the One Obvious Way to do it.
>>
>>This is a misconception. In Python 2.x, the type str already *is* a
>>bytes type. So if S is an instance of 2.x str, bytes(S) does not need
>>to do any conversion. You don't need to assume it is latin-1: it's
>>already bytes.
>>
>>>In fact, the 'encoding' argument seems useless in the case of str
>>>objects,
>>>and it seems it should default to latin-1 for unicode objects.
>>
>>I agree with the former, but not with the latter. There shouldn't be a
>>conversion of Unicode objects to bytes at all. If you want bytes from
>>a Unicode string U, write
>>
>>   bytes(U.encode(encoding))
>
>I like it, it makes sense. Unicode strings are simply not allowed as
>arguments to the byte constructor. Thinking about it, why would it be
>otherwise? And if you're mixing str-strings and unicode-strings, that
>means the str-strings you're sometimes giving are actually not byte
>strings, but character strings anyhow, so you should be encoding
>those too. bytes(s_or_U.encode('utf-8')) is a perfectly good spelling.

Actually, I think you mean:

     if isinstance(s_or_U, str):
         s_or_U = s_or_U.decode('utf-8')

     b = bytes(s_or_U.encode('utf-8'))

Or maybe:

     if isinstance(s_or_U, unicode):
         s_or_U = s_or_U.encode('utf-8')

     b = bytes(s_or_U)

Which is why I proposed that the boilerplate logic get moved *into* the 
bytes constructor.  I think this use case is going to be common in today's 
Python, but in truth I'm not as sure what bytes() will get used *for* in 
today's Python.  I'm probably overprojecting based on the need to use str 
objects now, but bytes aren't going to be a replacement for str for a good 
while anyway.

>Kill the encoding argument, and you're left with:
>
>Python2.X:
>- bytes(bytes_object) -> copy constructor
>- bytes(str_object) -> copy the bytes from the str to the bytes object
>- bytes(sequence_of_ints) -> make bytes with the values of the ints,
>error on overflow
>
>Python3.X removes str, and most APIs that did return str return bytes
>instead. Now all you have is:
>- bytes(bytes_object) -> copy constructor
>- bytes(sequence_of_ints) -> make bytes with the values of the ints,
>error on overflow
>
>Nice and simple.

I could certainly live with that approach, and it certainly rules out all 
the "when does the encoding argument apply and when should it be an error 
to pass it" questions.  :)