[Python-ideas] Create Python 2.8 as a transition step to Python 3.x

Sat Jan 18 08:56:21 CET 2014

From: Neil Schemenauer <nas-python at arctrix.com>

Sent: Friday, January 17, 2014 7:22 PM

> Here is a far out idea to make transition smoother.  Release version
> 2.8 of Python with nearly all Python 3.x incompatible changes except
> for the bytes/unicode changes.

What exactly do you mean by "the bytes/unicode changes"? There's a wide range of differences between 2.7 and 3.4 that could fall into this category. At least two of them, you'll specifically included in your proposed 2.8, including one of the three huge ones. Here's the ones I can think of off the top of my head, in rough order of most to least code-breaking:

 * No automatic conversions from bytes to unicode.

 * No automatic conversions from unicode to bytes.
 * Rename unicode to str (included in your suggestion).

 * File objects can be either unicode-based (text) or bytes-based (binary), defaulting to unicode.

 * The stdin/out/err files, StringIO, and various other common file objects are text.

 * __str__ (and __repr__) must return unicode, not bytes—and it's what print, "%s", default "{}", etc. call.

 * __bytes__ (the 3.x equivalent of 2.x's __str__) exists, but is not called by anything but bytes(), and is not supplied by most builtin/stdlib types (which is why, e.g., bytes(2) returns b'\0\0', not b'2').
 * Dozens of builtins and stdlib functions that used to work on bytes (or, in some cases, on either bytes or unicode) now work on unicode (e.g, csv.reader, json.loads).
 * Default string literal as unicode (included in your suggestion; already available with a future statement).
 * No bytes.encode or unicode.decode. (In 2.x, when used with codecs like 'ascii' or 'utf-8' these were almost always errors… but errors that a lot of badly-written code relies on to "work", as long as you never give it a non-ASCII character.)
 * No bytes.__mod__ or bytes.format (at least in 3.4; this may change later).
 * Bytes is an iterable of small ints rather than of single-char bytes.

 * File objects are the wrappers from the io module, not thin wrappers around C stdio.

 * All text files have universal newlines enabled, unless otherwise specified by the (not in 2.x) newline param.

 * Functions like chr and ord are based on Unicode code points, not bytes. (There are no bytes equivalent because there's no need if bytes is an iterable of ints.)

 * Different internal representation for unicode objects.

 * Different C API for unicode objects.
 * No basestring.

So… which of these do you want, and which do you not?

I suspect that, whatever your exact answers, it would be a lot easier to fork 3.4 and port the 2.7 behavior you want than to fork 2.7 and backport almost all of 3.4.

And if you do it that way, you could even adapt the idea someone proposed a few weeks ago—not popular on this list, but maybe popular with your target audience—of turning each change on and off with a "from __past__ import misfeature" statement, so people could pick and choose the ones they need, and gradually remove past statements as they port from your forked 2.8 to real 3.4.

However, I also suspect that, whatever your exact answers, it won't be that useful. Look at people's reasons for not moving to 3.x:

 * If your app already works in 2.7, and has no need for any new 3.x-only packages, it makes perfect sense to stay with 2.7. Which means there's no reason to move to 2.8.
 * If your app works in 2.7, but you're worried that it will eventually become hard to find supported 2.7 installations to run on, would you really expect finding 2.8 installations to be be easier?
 * If you're staying with 2.7 because your OS, hosting company, dev team, school, whatever provides it, there's no reason to go to 2.8.
 * If you depend on a package that hasn't been ported to 3.x… well, that's four separate issues.
 * If you depend on an in-house/small-market package that hasn't been ported, it's really the same case as "I have an app that works just fine in 2.7."

 * If you depend on a package that hasn't been ported because it's effectively moribund, it's not going to be ported to 2.8.
 * If you depend on a package that actually has been ported to 3.x, but you're too stupid to find information anywhere but blog posts or StackOverflow questions dated 2009 (which is depressingly common…), those posts are not going to tell you about 2.8.
 * If you depend on a package that's legitimately hard to port to 3.x, it obviously won't be ported to 2.8 yet either—and since it'll probably be a lower priority for the developers, even if 2.8 is an easier port than 3.4 there's no guarantee it'll come sooner. (Also, consider that typically, people depend on 6 packages that have been ported and 1 that hasn't; if they switch to 2.8, that'll be 7 packages they need to wait on rather than 1.)
 * If you have code that sort of works in 2.7 if you're careful to feed it only ASCII, just renaming str will almost certainly break your code. If you fix it, it will be as easy to port to 3.4 as to 2.8. If you don't fix it… well, at best this is the same as the first case; if not, it's the same as the next one.
 * If you have code that's legitimately difficult to port to 3.x because, e.g., it relies on parsing and creating network messages or file formats that mix ASCII text and binary or encoded-text payloads, just renaming str will break your code. And it may be non-trivial to fix.

I'm having a hard time imagining code that would be easy to port to 2.8, but not to 3.x. For example:

    payload = <some object with a __str__ method to serialize it>
    sock.sendall('Header: {}\r\nAnother: {}\r\n\r\n{}'.format(
        headers['header'], headers['another'], payload))

Even with just the two changes you already suggested: First, you have to change the literal to a bytes literal. More seriously, you have to rename that payload type's __str__ method to __bytes__. And if it does any string stuff internally, like encoding JSON, that has to change. Meanwhile, your logging code probably relies on the same _str__ method actually returning a str, so you have to add one of those. Assuming headers is a dict of strs, you either need to go back up the chain (or into the API that provides it) and change that so it's been a dict of bytes all along, or you need to explicitly encode the headers here. That doesn't sound too hard overall… but that gives you working Python 3.5 code (assuming PEP 460 goes through). And there doesn't seem to be any shortcut that would give you working 2.8 code without also working in 3.5.

Also, one quick comment:

> - removal of 'apply', 'buffer', 'callable', 

'callable' exists in Python 3.2+.

Not a big deal, unless this implies that you're basing everything on the state of the ecosystem back in Python 3.1. I don't think that it does, but just in case: Three years ago, people didn't have much experience with porting yet (e.g., writing 2.x code and running it through 2to3 at install time was considered the best way to port things gradually…) and most of PyPI didn't exist for 3.x yet. Back then, this suggestion would have been a lot more compelling than it is today, because all anyone could say was, "Wait and see, we're hoping it'll be better" instead of "Look and see, it already is better."