[Python-Dev] What to do for bytes in 2.6?

Fri Jan 18 05:43:47 CET 2008

On Jan 17, 2008 7:11 PM, Raymond Hettinger <python at rcn.com> wrote:
> > *If* we provide some kind of "backport" of
> > bytes (even if it's just an alias for or trivial
> > subclass of str), it should be part of a strategy
> > that makes it easier to write code that
> > runs under 2.6 and can be automatically translated
> > to run under 3.0 with the same semantics.
>
> If it's just an alias or trivial subclass, then we
> haven't added anything that can't be done trivially
> by the 2-to-3 tool.

I suggest you study how the 2to3 tool actually works before asserting this.

Consider the following function.

def stuff(blah):
  foo = ""
  while True:
    bar = blah.read(1024)
    if bar == "":
      break
    foo += bar
  return foo

Is it reading text or binary data from stream blah? We can't tell.  If
it's meant to be reading text, 2to3 should leave it alone. But if it's
meant to be reading binary data, 2to3 should change the string
literals to bytes literals (b"" in this case). (If it's used for both,
there's no hope.) As it stands, 2to3 hasn't a chance to decide what to
do, so it will leave it alone -- but the "translated" code will be
wrong if it was meant to be reading bytes.

However, if the two empty string literals were changed to b"", we
would know it was reading bytes. 2to3 could leave it alone, but at
least the untranslated code would be correct for 2.6 and the
translated code would be correct for 3.0.

This may seem trivial (because we do all the work, and 2to3 just
leaves stuff alone), but having b"" and bytes as aliases for "" and
str in 2.6 would mean that we could write 2.6 code that correctly
expresses the use of binary data -- and we could use u"" and unicode
for code using text, and 2to3 would translate those to "" and str and
the code would be correct 3.0 text processing code.

Note that we really can't make 2to3 assume that all uses of str and ""
are referring to binary data -- that would mistranslate the vast
majority of code that does non-Unicode-aware text processing, which I
estimate is the majority of small and mid-size programs.

> I'm thinking that this is a deeper change.
> It doesn't serve either 2.6 or 3.0 to conflate
> str/unicode model with the bytes/text model.
> Mixing the two in one place just creates a mess
> in that one place.
>
> I'm sure we're thinking that this is just an optional
> transition tool, but the reality is that once people
> write 2.6 tools that use the new model,
> then 2.6 users are forced to deal with that model.
> It stops being optional or something in the future,
> it becomes a mental jump that needs to be made now
> (while still retaining the previous model in mind
> for all the rest of the code).

This may be true. But still, 2.6 *will* run 2.5 code without any
effort, so we will be able to mix modules using the 2.5 style and
modules using the 3.0 style (or at least some aspects of 3.0 style) in
one interpreter. Neither 2.5 nor 3.0 will support this combination.
That's why 2.6 is so important it's a stepping stone.

> I don't think you need a case study to forsee that
> it will be unpleasant to work with a code base
> that commingles the two world views.

Well, you shouldn't commingle the two world view in a single module or
package. But that would just be bad style -- you shouldn't use
competing style rules within a package either (like using
words_with_underscores and camelCaseWords for method names).

> One other thought.  I'm guessing that apps that would
> care about the distinction are already using unicode
> and are already treating text as distinct from arrays
> of bytes.

Yes, but 99% of these still accept str instances in positions where
they require text. The problem is that the str type and its literals
are ambiguous -- their use is not enough to be able to guess whether
text or data is meant. Just being able to (voluntarily! on a
per-module basis!) use a different type name and literal style for
data could help forward-looking programmers get started on making the
distinction clear, thus getting ready for 3.0 without making the jump
just yet (or maintaining a 2.6 and a 3.0 version of the same package
easily, using 2to3 to automatically generate the 3.0 version from the
2.6 code base).

> Instead, it's backwards thinking 20th-century
> neanderthal ascii-bound folks like myself who are going
> to have transition issues.  It would be nice for us
> knuckle-draggers to not have to face the issue until 3.0.

Oh, you won't. Just don't use the -3 command-line flag and don't put
"from __future__ import <whatever>" at the top of your modules, and
you won't have to change your ways at all. You can continue to
distribute your packages in 2.5 syntax that will also work with 2.6,
and your users will be happy (as long as they don't want to use your
code on 3.0 -- but if you want to give them that, *that* is when you
will finally be forced to face the issue. :-)

Note that I believe that the -3 flag should not change semantics -- it
should only add warnings. Semantic changes must either be backwards
compatible or be requested explicitly with a __forward__ import (which
2to3 can remove).

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)