[Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]

Tue Feb 14 00:23:45 CET 2006

On 2/13/06, Phillip J. Eby <pje at telecommunity.com> wrote:
> At 12:03 AM 2/14/2006 +0100, M.-A. Lemburg wrote:
> >The conversion from Unicode to bytes is different in this
> >respect, since you are converting from a "bigger" type to
> >a "smaller" one. Choosing latin-1 as default for this
> >conversion would give you all 8 bits, instead of just 7
> >bits that ASCII provides.
>
> I was just pointing out that since byte strings are bytes by definition,
> then simply putting those bytes in a bytes() object doesn't alter the
> existing encoding.  So, using latin-1 when converting a string to bytes
> actually seems like the the One Obvious Way to do it.

This actually makes some sense -- bytes(s) where isinstance(s, str)
should just copy the data, since we can't know what encoding the user
believes it is in anyway. (With the exception of string literals,
where it makes sense to assume that the user believes it is in the
same encoding as the source code -- but I believe non-ASCII characters
in string literals are disallowed anyway, or at least known to cause
undefined results in rats.)

> I'm so accustomed to being wary of encoding issues that the idea doesn't
> *feel* right at first - I keep going, "but you can't know what encoding
> those bytes are".  Then I go, Duh, that's the point.  If you convert
> str->bytes, there's no conversion and no interpretation - neither the str
> nor the bytes object knows its encoding, and that's okay.  So
> str(bytes_object) (in 2.x) should also just turn it back to a normal
> bytestring.

You've got me convinced. Scrap my previous responses in this thread.

> In fact, the 'encoding' argument seems useless in the case of str objects,

Right.

> and it seems it should default to latin-1 for unicode objects.

But here I disagree.

> The only
> use I see for having an encoding for a 'str' would be to allow confirming
> that the input string in fact is valid for that encoding.  So,
> "bytes(some_str,'ascii')" would be an assertion that some_str must be valid
> ASCII.

We already have ways to assert that a string is ASCII.

> For 3.0, the type formerly known as "str" won't exist, so only the Unicode
> part will be relevant then.

And I think then the encoding should be required or default to ASCII.

--
--Guido van Rossum (home page: http://www.python.org/~guido/)