[Python-Dev] What to do for bytes in 2.6?

Sun Jan 20 08:49:56 CET 2008

On 04:26 am, guido at python.org wrote:
>On Jan 19, 2008 5:54 PM,  <glyph at divmod.com> wrote:
>>On 19 Jan, 07:32 pm, guido at python.org wrote:

Starting with the most relevant bit before getting off into digressions 
that may not interest most people:
>>Why can't we get that warning in -3 mode just the same from something
>>read from a socket and a b"" literal?

>If you really want this, please think through all the consequences,
>and report back here. While I have a hunch that it'll end up giving
>too many false positives and at the same time too many false
>negatives, perhaps I haven't thought it through enough. But if you
>really think this'll be important for you, I hope you'll be willing to
>do at least some of the thinking.

While I stand by my statement that unicode is the Right Way to do text 
in python, this particular feature isn't really that important, and I 
can see there are cases where it might cause problems or make life more 
difficult.  I suspect that I won't really know whether I want the 
warning anyway before I've actually tried to port any nuanced, real 
text-processing code to 3.0, and it looks like it's going to be a little 
while before that happens.  I suspect that if I do want the warning, it 
would be a feature for 2.7, not 2.6, so I don't want to waste a lot of 
everyone's time advocating for it.

Now for a nearly irrelevant digression (please feel free to stop reading 
here):
>>Now, ad-hoc code with a fast and loose definition of "text" can still
>>read arrays of bytes off a socket without specifying an encoding and 
>>get
>>away with it, but that's because Python's unicode implementation has
>>thus far been very forgiving, not because the data is cleanly text 
>>yet.
>
>I would say that depends on the application, and on arrangements that
>client and server may have made off-line about the encoding.

I can see your point.  I think it probably holds better on files and 
streams than on sockets, though - please forgive me if I don't think 
that server applications which require environment-dependent out-of-band 
arrangements about locale are correct :).
>In 2.x, text can legitimately be represented as str -- there's even
>the locale module to further specify how it is to be interpreted as
>characters.

I'm aware that this specific example is kind of a ridiculous stretch, 
but it's the first one that came to mind.  Consider 
len(u'é'.encode('utf-8').rjust(5).decode('utf-8')).  Of course 
unicode.rjust() won't do the right thing in the case of surrogate pairs, 
not to mention RTL text, but it still handles a lot more cases than 
str.rjust(), since code points behave a lot more like characters than 
code units do.
>Sure, this doesn't work for full unicode, and it doesn't work for all
>protocols used with sockets, but claiming that only fast and loose
>code ever uses str to represent text is quite far from reality -- this
>would be saying that the locale module is only for quick and dirty
>code, which just ain't so.

It would definitely be overreaching to say all code that uses str is 
quick and dirty.  But I do think that it fits into one of two 
categories: quick and dirty, or legacy.  locale is an example of a 
legacy case for which there is no replacement (that I'm aware of).  Even 
if I were writing a totally unicode-clean application, as far as I'm 
aware, there's no common replacement for i.e. locale.currency().

Still, locale is limiting.  It's ... uncomfortable to call 
locale.currency() in a multi-user server process.  It would be nice if 
there were a replacement that completely separated encoding issues from 
localization issues.
>I believe that a constraint should be that by default (without -3 or a
>__future__ import) str and bytes should be the same thing. Or, another
>way of looking at this, reads from binary files and reads from sockets
>(and other similar things, like ctypes and mmap and the struct module,
>for example) should return str instances, not instances of a str
>subclass by default -- IMO returning a subclass is bound to break too
>much code. (Remember that there is still *lots* of code out there that
>uses "type(x) is types.StringType)" rather than "isinstance(x, str)",
>and while I'd be happy to warn about that in -3 mode if we could, I
>think it's unacceptable to break that in the default environment --
>let it break in 3.0 instead.)

I agree.  But, it's precisely because this is so subtle that it would be 
nice to have tools which would report warnings to help fix it. 
*Certainly* by default, everywhere that's "str" in 2.5 should be "str" 
in 2.6.  Probably even in -3 mode, if the goal there is "warnings only". 
However, the feature still strikes me as potentially useful while 
porting.  If I were going to advocate for it, though, it would be as a 
separate option, e.g. "--separate-bytes-type".  I say this as separate 
from just trying to run the code on 3.0 to see what happens because it 
seems like the most subtle and difficult aspect of the port to get 
right; it would be nice to be able to tweak it individually, without the 
other issues related to 3.0.  For example, some of the code I work on 
has a big stack of dependencies.  Some of those are in C, most of them 
don't process text at all.  However, most of them aren't going to port 
to 3.0 very early, but it would be good to start running in as 3.0-like 
of an environment as possible earlier than that so that the hard stuff 
is done by the time the full stack has been migrated.
>>I've written lots of code that
>>aggressively rejects str() instances as text, as well as unicode
>>instances as bytes, and that's in code that still supports 2.3 ;).
>
>Yeah, well, but remember, while keeping you happy is high on my list

Thanks, good to hear :)
>of priorities, it's not the only priority. :-)

I don't think it's even my fiancée's *only* priority, and I think it 
should stay higher on her list than yours ;-).