[Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]

Tue Feb 14 01:09:57 CET 2006

At 03:23 PM 2/13/2006 -0800, Guido van Rossum wrote:
>On 2/13/06, Phillip J. Eby <pje at telecommunity.com> wrote:
> > The only
> > use I see for having an encoding for a 'str' would be to allow confirming
> > that the input string in fact is valid for that encoding.  So,
> > "bytes(some_str,'ascii')" would be an assertion that some_str must be valid
> > ASCII.
>
>We already have ways to assert that a string is ASCII.

I didn't mean that it was the only purpose.  In Python 2.x, practical code 
has to sometimes deal with "string-like" objects.  That is, code that takes 
either strings or unicode.  If such code calls bytes(), it's going to want 
to include an encoding so that unicode conversions won't fail.  But 
silently ignoring the encoding argument in that case isn't a good idea.

Ergo, I propose to permit the encoding to be specified when passing in a 
(2.x) str object, to allow code that handles both str and unicode to be 
"str-stable" in 2.x.

I'm fine with rejecting an encoding argument if the initializer is not a 
str or unicode; I just don't want the call signature to vary based on a 
runtime distinction between str and unicode.  And, I don't want the 
encoding argument to be silently ignored when you pass in a string.  If I 
assert that I'm encoding ASCII (or utf-8 or whatever), then the string 
should be required to be valid.  If I don't pass in an encoding, then I'm 
good to go.

(This is orthogonal to the issue of what encoding is used as a default for 
conversions from the unicode type, btw.)

> > For 3.0, the type formerly known as "str" won't exist, so only the Unicode
> > part will be relevant then.
>
>And I think then the encoding should be required or default to ASCII.

The reason I'm arguing for latin-1 is symmetry in 2.x versions only.  (In 
3.x, there's no str vs. unicode, and thus nothing to be symmetrical.)  So, 
if you invoke bytes() without an encoding on a 2.x basestring, you should 
get the same result.  Latin-1 produces "the same result" when viewed in 
terms of the resulting byte string.

If we don't go with latin-1, I'd argue for requiring an encoding for 
unicode objects in 2.x, because that seems like the only reasonable way to 
break the symmetry between str and unicode, even though it forces 
"str-stable" code to specify an encoding.  The key is that at least *one* 
of the signatures needs to be stable in meaning across both str and unicode 
in 2.x in order to allow unicode-safe, str-stable code to be written.

(Again, for 3.x, this issue doesn't come into play because there's only one 
string type to worry about; what the default is or whether there's a 
default is therefore entirely up to you.)