[Python-ideas] Adding 'bytes' as alias for 'latin_1' codec.

Mon May 30 22:27:05 CEST 2011

On Sun, May 29, 2011 at 9:45 PM, Nick Coghlan <ncoghlan at gmail.com> wrote:
> On Mon, May 30, 2011 at 12:39 PM, Stephen J. Turnbull
> <stephen at xemacs.org> wrote:
>> (However, there are use
>> cases where it is claimed that 'HELO ' is needed both as str and as
>> bytes.)
>
> My current opinion is that all of this still needs more
> experimentation outside the core before we start fiddling any further
> with the builtins (we blinked once in the lead-up to 3.0 by allowing
> bytes and bytearray to retain a lot of string methods that assume an
> ASCII compatible encoding, and I now have my doubts about the wisdom
> of even that step). I don't have a good answer on how to deal with the
> real world situations where the *use case* blurs the bytes/text
> distinction (typically by embedding ASCII text inside an otherwise
> binary protocol), and given the potential to backslide into the bad
> old days of 8-bit strings, I'm not prepared to guess, either.
>
> 3.x has largely cleared the decks to allow a better solution to evolve
> in this space by making it harder to blur the line accidentally, and
> decode()/manipulate/encode() already nicely covers many stateless use
> cases. If it turns out we need another type, or some other API, to
> deal gracefully with any use cases where that isn't enough, then so be
> it. However, I think we need to let the status quo run for a while
> longer and see what people actually using the current types in
> production come up with. The bytes/text division in Python 3 is by far
> the biggest conceptual change between the two languages, so it's going
> to take some time before we can figure out how many of the problems
> encountered are real issues with the split model not covering some use
> cases and how many are just people (including us) taking time to get
> used to the sharp division between the two worlds.

Well said, Nick. We ought to attempt to live with the current
situation for quite a bit longer before stirring the pot again.

My feeling is that one of the main reasons why this topic keeps coming
up is simply that it is different from Python 2 -- this is "the year
of Python 3" so more people than ever before are discovering the
differences between Python 2 and 3. Most people's minds probably
haven't switched over, and the solutions and attitudes that worked in
Python 2 don't always work so well in Python 3.

Let's also remember that while Python is not exactly blazing a new
trail here, it is also not following the most conservative course.
Most languages of Python's vintage or older are still using a model
that blurs the line between text and binary data, representing Unicode
text as bytes that happen to be encoded in some encoding. Even if the
language assumes a default encoding this doesn't mean that all data
manipulated is actually text encoded in that encoding -- it just means
that you may get nonsense when you use text operations on data that
uses some other encoding, just as you get nonsense when you use text
operations on binary data (e.g. using readlines() on a JPEG file).

Python lets you do this too, to some extent, with some of the text
operations on bytes data, and this is definitely a compromise. I hope
that we have built in just enough friction to remind people that this
is not the best way to deal with text most of the time, while still
allowing advanced users who are writing e.g. parsers for Internet
protocols to stay at the bytes layer at a reasonable cost. Personally
I think we got this close enough to right that we won't having to
rethink the whole thing, even if small tweaks might be possible; but
there's no need to rush.

-- 
--Guido van Rossum (python.org/~guido)