[Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]

Tue Feb 14 01:29:27 CET 2006

On 2/13/06, Phillip J. Eby <pje at telecommunity.com> wrote:
> I didn't mean that it was the only purpose.  In Python 2.x, practical code
> has to sometimes deal with "string-like" objects.  That is, code that takes
> either strings or unicode.  If such code calls bytes(), it's going to want
> to include an encoding so that unicode conversions won't fail.

That sounds like a rather hypothetical example. Have you thought it
through? Presumably code that accepts both str and unicode either
doesn't care about encodings, but simply returns objects of the same
type as the arguments -- and then it's unlikely to want to convert the
arguments to bytes; or it *does* care about encodings, and then it
probably already has to special-case str vs. unicode because it has to
control how str objects are interpreted.

> But
> silently ignoring the encoding argument in that case isn't a good idea.
>
> Ergo, I propose to permit the encoding to be specified when passing in a
> (2.x) str object, to allow code that handles both str and unicode to be
> "str-stable" in 2.x.

Again, have you thought this through?

What would bytes("abc\xf0", "latin-1") *mean*? Take the string
"abc\xf0", interpret it as being encoded in XXX, and then encode from
XXX to Latin-1. But what's XXX? As I showed in a previous post,
"abc\xf0".encode("latin-1") *fails* because the source for the
encoding is assumed to be ASCII.

I think we can make this work only when the string in fact only
contains ASCII and the encoding maps ASCII to itself (which most
encodings do -- but e.g. EBCDIC does not). But I'm not sure how useful
that is.

> I'm fine with rejecting an encoding argument if the initializer is not a
> str or unicode; I just don't want the call signature to vary based on a
> runtime distinction between str and unicode.

I'm still not sure that this will actually help anyone.

> And, I don't want the
> encoding argument to be silently ignored when you pass in a string.

Agreed.

> If I
> assert that I'm encoding ASCII (or utf-8 or whatever), then the string
> should be required to be valid.

Defined how? That the string is already in that encoding?

> If I don't pass in an encoding, then I'm
> good to go.
>
> (This is orthogonal to the issue of what encoding is used as a default for
> conversions from the unicode type, btw.)

Right. The issues are completely different!

> > > For 3.0, the type formerly known as "str" won't exist, so only the Unicode
> > > part will be relevant then.
> >
> >And I think then the encoding should be required or default to ASCII.
>
> The reason I'm arguing for latin-1 is symmetry in 2.x versions only.  (In
> 3.x, there's no str vs. unicode, and thus nothing to be symmetrical.)  So,
> if you invoke bytes() without an encoding on a 2.x basestring, you should
> get the same result.  Latin-1 produces "the same result" when viewed in
> terms of the resulting byte string.

Only if you assume the str object is encoded in Latin-1.

Your argument for symmetry would be a lot stronger if we used Latin-1
for the conversion between str and Unicode. But we don't. I like the
other interpretation (which I thought was yours too?) much better: str
<--> bytes conversions don't use encodings by simply change the type
without changing the bytes; conversion between either and unicode
works exactly the same, and requires an encoding unless all the
characters involved are pure ASCII.

> If we don't go with latin-1, I'd argue for requiring an encoding for
> unicode objects in 2.x, because that seems like the only reasonable way to
> break the symmetry between str and unicode, even though it forces
> "str-stable" code to specify an encoding.  The key is that at least *one*
> of the signatures needs to be stable in meaning across both str and unicode
> in 2.x in order to allow unicode-safe, str-stable code to be written.

Using ASCII as the default encoding has the same property -- it can
remain stable across the 2.x / 3.0 boundary.

> (Again, for 3.x, this issue doesn't come into play because there's only one
> string type to worry about; what the default is or whether there's a
> default is therefore entirely up to you.)

A nice-to-have property would be that it might be possible to write
code that today deals with Unicode and str, but in 3.0 will deal with
Unicode and bytes instead. But I'm not sure how likely that is since
bytes objects won't have most methods that str and Unicode objects
have (like lower(), find(), etc.).

There's one property that bytes, str and unicode all share: type(x[0])
== type(x), at least as long as len(x) >= 1. This is perhaps the
ultimate test for string-ness.

Or should b[0] be an int, if b is a bytes object? That would change
things dramatically.

There's also the consideration for APIs that, informally, accept
either a string or a sequence of objects. Many of these exist, and
they are probably all being converted to support unicode as well as
str (if it makes sense at all). Should a bytes object be considered as
a sequence of things, or as a single thing, from the POV of these
types of APIs? Should we try to standardize how code tests for the
difference? (Currently all sorts of shortcuts are being taken, from
isinstance(x, (list, tuple)) to isinstance(x, basestring).)

--
--Guido van Rossum (home page: http://www.python.org/~guido/)