[Python-ideas] Standard (portable) bytecode "assembly" format

Fri Feb 26 00:27:02 EST 2016

On -dev (http://article.gmane.org/gmane.comp.python.devel/156543), Demur Rumed asked what it would take to get the "wordcode" patch into Python. Obviously, we need to finish it, benchmark it, etc., but on top of that, as Guido pointed out:

> An unfortunate issue however is that many projects seem to make a
> hobby of hacking bytecode. All those projects would have to be totally
> rewritten in order to support the new wordcode format (as opposed to
> just having to be slightly adjusted to support the occasional new
> bytecode opcode). Those projects of course don't work with Pypy or
> Jython either, but they do work for mainstream CPython, and it's
> unacceptable to just leave them all behind.

Greg Ewing replied:

> Maybe this argues for having an assembly-language-like
> intermediate form between the AST and the actual code
> used by the interpreter? Done properly it could make
> things easier for bytecode-hacking projects as well as
> providing some insulation from implementation details.

I think he's right.

Of course we already have such a format today: dis.Bytecode. But it doesn't quite solve the problem, for three reasons:

 * Not accessible from C.
 * Not mutable, and no assembler.

 * A few things (mainly jump arguments) are still in terms of bytecode bytes.

But fix that, and we have a format that will be unchanged with wordcode, and that can work out of the box in MicroPython (which has a not-quite-CPython bytecode format), and so on. I think if we do that for 3.6, then it's plausible to consider wordcode for 3.7.
And, fix it well enough, and it also solves the problem I brought up a few weeks ago (http://article.gmane.org/gmane.comp.python.ideas/38431): if PEP 511 is going to provide a builtin API for registering bytecode processors, we should make it feasible to write them.

I have a somewhat complete proposal (at http://stupidpythonideas.blogspot.com/2016/02/a-standard-assembly-format-for-python.html), but until I actually implement it, most people should only care about this summary:

 * Iterable of (opcode, argval [, line [, ...]]) tuples. The argval is the actual global name, constant value, etc., not the encoded index, etc. For jumps, the argval is just the target instruction itself. The existing dis.Bytecode (with a few minor changes) already fits this type--but so does, say, a list of 3-tuples, which we can much more easily build in C.

 * The assemble function from compile.c doesn't need that much work to convert it into a PyCode_Assemble/dis.assemble that takes such an iterable (plus optional name, filename, and first_line) and generates a code object. The compiler can then use the same function as pure Python code. And PyCode_Assemble is the only new C API function needed.

 * We already have a disassembler for this format in the stdlib since 3.4. It does need a few minor changes, and there are a few simple extensions that I think are worth adding (like making Bytecode a MutableSequence), but that's it.

 * Assuming the assembler drops NOPs, we can use NOPs as pseudo-instructions for when you want byteplay-like Label and SetLineNo. The disassembler can optionally even generate them. So, we don't need explicit pseudo-instructions.

 * Any higher-level representations, like a graph of blocks with edges for the jumps between them, are easy enough to build on top of the dis representation (and to flatten back into that representation), so we don't need anything more complicated in the stdlib.