Re: [Python-ideas] Adding 'bytes' as alias for 'latin_1' codec.

May 30, 2011

      On Sun, May 29, 2011 at 9:45 PM, Nick Coghlan <ncoghlan@gmail.com> wrote:
...
On Mon, May 30, 2011 at 12:39 PM, Stephen J. Turnbull
<stephen@xemacs.org> wrote:
...
(However, there are use
cases where it is claimed that 'HELO ' is needed both as str and as
bytes.)
My current opinion is that all of this still needs more
experimentation outside the core before we start fiddling any further
with the builtins (we blinked once in the lead-up to 3.0 by allowing
bytes and bytearray to retain a lot of string methods that assume an
ASCII compatible encoding, and I now have my doubts about the wisdom
of even that step). I don't have a good answer on how to deal with the
real world situations where the *use case* blurs the bytes/text
distinction (typically by embedding ASCII text inside an otherwise
binary protocol), and given the potential to backslide into the bad
old days of 8-bit strings, I'm not prepared to guess, either.
3.x has largely cleared the decks to allow a better solution to evolve
in this space by making it harder to blur the line accidentally, and
decode()/manipulate/encode() already nicely covers many stateless use
cases. If it turns out we need another type, or some other API, to
deal gracefully with any use cases where that isn't enough, then so be
it. However, I think we need to let the status quo run for a while
longer and see what people actually using the current types in
production come up with. The bytes/text division in Python 3 is by far
the biggest conceptual change between the two languages, so it's going
to take some time before we can figure out how many of the problems
encountered are real issues with the split model not covering some use
cases and how many are just people (including us) taking time to get
used to the sharp division between the two worlds.
Well said, Nick. We ought to attempt to live with the current
situation for quite a bit longer before stirring the pot again.

My feeling is that one of the main reasons why this topic keeps coming
up is simply that it is different from Python 2 -- this is "the year
of Python 3" so more people than ever before are discovering the
differences between Python 2 and 3. Most people's minds probably
haven't switched over, and the solutions and attitudes that worked in
Python 2 don't always work so well in Python 3.

Let's also remember that while Python is not exactly blazing a new
trail here, it is also not following the most conservative course.
Most languages of Python's vintage or older are still using a model
that blurs the line between text and binary data, representing Unicode
text as bytes that happen to be encoded in some encoding. Even if the
language assumes a default encoding this doesn't mean that all data
manipulated is actually text encoded in that encoding -- it just means
that you may get nonsense when you use text operations on data that
uses some other encoding, just as you get nonsense when you use text
operations on binary data (e.g. using readlines() on a JPEG file).

Python lets you do this too, to some extent, with some of the text
operations on bytes data, and this is definitely a compromise. I hope
that we have built in just enough friction to remind people that this
is not the best way to deal with text most of the time, while still
allowing advanced users who are writing e.g. parsers for Internet
protocols to stay at the bytes layer at a reasonable cost. Personally
I think we got this close enough to right that we won't having to
rethink the whole thing, even if small tweaks might be possible; but
there's no need to rush.

-- 
--Guido van Rossum (python.org/~guido)