
On Sun, May 29, 2011 at 9:45 PM, Nick Coghlan <ncoghlan@gmail.com> wrote:
On Mon, May 30, 2011 at 12:39 PM, Stephen J. Turnbull <stephen@xemacs.org> wrote:
(However, there are use cases where it is claimed that 'HELO ' is needed both as str and as bytes.)
My current opinion is that all of this still needs more experimentation outside the core before we start fiddling any further with the builtins (we blinked once in the lead-up to 3.0 by allowing bytes and bytearray to retain a lot of string methods that assume an ASCII compatible encoding, and I now have my doubts about the wisdom of even that step). I don't have a good answer on how to deal with the real world situations where the *use case* blurs the bytes/text distinction (typically by embedding ASCII text inside an otherwise binary protocol), and given the potential to backslide into the bad old days of 8-bit strings, I'm not prepared to guess, either.
3.x has largely cleared the decks to allow a better solution to evolve in this space by making it harder to blur the line accidentally, and decode()/manipulate/encode() already nicely covers many stateless use cases. If it turns out we need another type, or some other API, to deal gracefully with any use cases where that isn't enough, then so be it. However, I think we need to let the status quo run for a while longer and see what people actually using the current types in production come up with. The bytes/text division in Python 3 is by far the biggest conceptual change between the two languages, so it's going to take some time before we can figure out how many of the problems encountered are real issues with the split model not covering some use cases and how many are just people (including us) taking time to get used to the sharp division between the two worlds.
Well said, Nick. We ought to attempt to live with the current situation for quite a bit longer before stirring the pot again. My feeling is that one of the main reasons why this topic keeps coming up is simply that it is different from Python 2 -- this is "the year of Python 3" so more people than ever before are discovering the differences between Python 2 and 3. Most people's minds probably haven't switched over, and the solutions and attitudes that worked in Python 2 don't always work so well in Python 3. Let's also remember that while Python is not exactly blazing a new trail here, it is also not following the most conservative course. Most languages of Python's vintage or older are still using a model that blurs the line between text and binary data, representing Unicode text as bytes that happen to be encoded in some encoding. Even if the language assumes a default encoding this doesn't mean that all data manipulated is actually text encoded in that encoding -- it just means that you may get nonsense when you use text operations on data that uses some other encoding, just as you get nonsense when you use text operations on binary data (e.g. using readlines() on a JPEG file). Python lets you do this too, to some extent, with some of the text operations on bytes data, and this is definitely a compromise. I hope that we have built in just enough friction to remind people that this is not the best way to deal with text most of the time, while still allowing advanced users who are writing e.g. parsers for Internet protocols to stay at the bytes layer at a reasonable cost. Personally I think we got this close enough to right that we won't having to rethink the whole thing, even if small tweaks might be possible; but there's no need to rush. -- --Guido van Rossum (python.org/~guido)