[Python-Dev] What to do for bytes in 2.6?
glyph at divmod.com
glyph at divmod.com
Sun Jan 20 08:49:56 CET 2008
On 04:26 am, guido at python.org wrote:
>On Jan 19, 2008 5:54 PM, <glyph at divmod.com> wrote:
>>On 19 Jan, 07:32 pm, guido at python.org wrote:
Starting with the most relevant bit before getting off into digressions
that may not interest most people:
>>Why can't we get that warning in -3 mode just the same from something
>>read from a socket and a b"" literal?
>If you really want this, please think through all the consequences,
>and report back here. While I have a hunch that it'll end up giving
>too many false positives and at the same time too many false
>negatives, perhaps I haven't thought it through enough. But if you
>really think this'll be important for you, I hope you'll be willing to
>do at least some of the thinking.
While I stand by my statement that unicode is the Right Way to do text
in python, this particular feature isn't really that important, and I
can see there are cases where it might cause problems or make life more
difficult. I suspect that I won't really know whether I want the
warning anyway before I've actually tried to port any nuanced, real
text-processing code to 3.0, and it looks like it's going to be a little
while before that happens. I suspect that if I do want the warning, it
would be a feature for 2.7, not 2.6, so I don't want to waste a lot of
everyone's time advocating for it.
Now for a nearly irrelevant digression (please feel free to stop reading
here):
>>Now, ad-hoc code with a fast and loose definition of "text" can still
>>read arrays of bytes off a socket without specifying an encoding and
>>get
>>away with it, but that's because Python's unicode implementation has
>>thus far been very forgiving, not because the data is cleanly text
>>yet.
>
>I would say that depends on the application, and on arrangements that
>client and server may have made off-line about the encoding.
I can see your point. I think it probably holds better on files and
streams than on sockets, though - please forgive me if I don't think
that server applications which require environment-dependent out-of-band
arrangements about locale are correct :).
>In 2.x, text can legitimately be represented as str -- there's even
>the locale module to further specify how it is to be interpreted as
>characters.
I'm aware that this specific example is kind of a ridiculous stretch,
but it's the first one that came to mind. Consider
len(u'é'.encode('utf-8').rjust(5).decode('utf-8')). Of course
unicode.rjust() won't do the right thing in the case of surrogate pairs,
not to mention RTL text, but it still handles a lot more cases than
str.rjust(), since code points behave a lot more like characters than
code units do.
>Sure, this doesn't work for full unicode, and it doesn't work for all
>protocols used with sockets, but claiming that only fast and loose
>code ever uses str to represent text is quite far from reality -- this
>would be saying that the locale module is only for quick and dirty
>code, which just ain't so.
It would definitely be overreaching to say all code that uses str is
quick and dirty. But I do think that it fits into one of two
categories: quick and dirty, or legacy. locale is an example of a
legacy case for which there is no replacement (that I'm aware of). Even
if I were writing a totally unicode-clean application, as far as I'm
aware, there's no common replacement for i.e. locale.currency().
Still, locale is limiting. It's ... uncomfortable to call
locale.currency() in a multi-user server process. It would be nice if
there were a replacement that completely separated encoding issues from
localization issues.
>I believe that a constraint should be that by default (without -3 or a
>__future__ import) str and bytes should be the same thing. Or, another
>way of looking at this, reads from binary files and reads from sockets
>(and other similar things, like ctypes and mmap and the struct module,
>for example) should return str instances, not instances of a str
>subclass by default -- IMO returning a subclass is bound to break too
>much code. (Remember that there is still *lots* of code out there that
>uses "type(x) is types.StringType)" rather than "isinstance(x, str)",
>and while I'd be happy to warn about that in -3 mode if we could, I
>think it's unacceptable to break that in the default environment --
>let it break in 3.0 instead.)
I agree. But, it's precisely because this is so subtle that it would be
nice to have tools which would report warnings to help fix it.
*Certainly* by default, everywhere that's "str" in 2.5 should be "str"
in 2.6. Probably even in -3 mode, if the goal there is "warnings only".
However, the feature still strikes me as potentially useful while
porting. If I were going to advocate for it, though, it would be as a
separate option, e.g. "--separate-bytes-type". I say this as separate
from just trying to run the code on 3.0 to see what happens because it
seems like the most subtle and difficult aspect of the port to get
right; it would be nice to be able to tweak it individually, without the
other issues related to 3.0. For example, some of the code I work on
has a big stack of dependencies. Some of those are in C, most of them
don't process text at all. However, most of them aren't going to port
to 3.0 very early, but it would be good to start running in as 3.0-like
of an environment as possible earlier than that so that the hard stuff
is done by the time the full stack has been migrated.
>>I've written lots of code that
>>aggressively rejects str() instances as text, as well as unicode
>>instances as bytes, and that's in code that still supports 2.3 ;).
>
>Yeah, well, but remember, while keeping you happy is high on my list
Thanks, good to hear :)
>of priorities, it's not the only priority. :-)
I don't think it's even my fiancée's *only* priority, and I think it
should stay higher on her list than yours ;-).
More information about the Python-Dev
mailing list