[Python-ideas] Standard (portable) bytecode "assembly" format

Fri Feb 26 13:15:58 EST 2016

On Feb 26, 2016, at 02:27, Victor Stinner <victor.stinner at gmail.com> wrote:
> 
> 2016-02-26 6:27 GMT+01:00 Andrew Barnert via Python-ideas
> <python-ideas at python.org>:
>> Of course we already have such a format today: dis.Bytecode. But it doesn't quite solve the problem, for three reasons:
>> 
>> * Not accessible from C.
> 
> I don't think that it's a real issue. The current trend is more to
> rewrite pieces of CPython in Python.

Sure, we could either (a) have duplicate code in C and Python that do virtually the same assembly and fixup work, (b) rewrite the peephole optimizer and part of the compiler in Python and freeze both them and the dis module (or whatever), or (c) use a format that's accessible from both C and Python and change as little as possible to get what we want. I think the last one is clearly the best solution, but it's not because the other two aren't impossible.

>> * Not mutable, and no assembler.
> 
> I looked at Bytecode & Instruction objects of dis. They look nice to
> "read" bytecode, but not to modify bytecode.
> 
> dis.Instruction is not mutable and informations are duplicated.

Which is exactly why I suggested the very alternative that you're replying to: tuples of (opcode, argval [, line [, ...]]) are trivial to build. Instruction (with a minor, backward-compatible change) is compatible with that, but you don't need to use Instruction. Similarly, an iterable of such tuples is trivial to build; Bytecode is compatible with that, but you don't need to use Bytecode.

Here's an example of what a bytecode processor could look like:

    for opcode, argval, *rest in instructions:
        if opcode == dis.LOAD_GLOBAL:
            yield (dis.LOAD_CONST, eval(argval, globals(), *rest)
        else:
            yield (opcode, argval, *rest)

If you want to use the dis structures instead, you don't have to, but you can:

    bc = dis.Bytecode(instructions)
    for i, instr in enumerate(bc):
        if instr.opcode == dis.LOAD_GLOBAL:
            bc[i] = instr.replace(opcode=dis.LOAD_CONST, eval(instr.argval, globals()))
    return bc

And notice that, even if you _do_ want to use those structures, the problems you're imagining don't arise.

There are more complicated examples on the linked blog post.

> For
> example, the operator is stored as name (LOAD_CONST) and code (100).
> Argument is stored as int (1), value ("hello") and representation
> ('"hello"'). It has no methods but attributes like is_jump_target.

And, as I said, you only have to supply opcode, argval, and sometimes line. The other attributes are there for reading existing bytecode, but aren't needed for emitting it.

This is the same model that's used successfully in the tokenize module. (Of course that module has some other API nightmares, but _this_ part of it is very nice.) Tokens are a namedtuple with 6 attributes, but you can construct them with just the first 2, 3, or 4, or you can just substitute a tuple of 2, 3, or 4 elements in place of a Token.

> dis.Instruction doesn't seem extensible to add new features.

Why not? I added hasjrel to see how easy it is: there's one obvious way to do it, which took a few seconds, and it works exactly as I'd want it to. What kind of new features do you think would be difficult to add?

>> * A few things (mainly jump arguments) are still in terms of bytecode bytes.
> 
> Hum, this is a problem. The dis is already in the stdlib, you cannot
> modify its API in a backward incompatible way.

Again, already covered, and covered in more detail in the blog post.

> Abstract instructions can be modified (lineno, name, op, arg), they
> have no size. Concrete instructions have size, attributes cannot be
> modified.
> 
> Concrete bytecode & instructions is closer to what we already have in
> the dis module.

No it isn't. What we have in the dis module does _not_ have size; it's a flat sequence of instructions. If you've missed that, you probably need to go back and reread the proposal, because it doesn't really make sense if you think this is what it's suggesting.

> BytecodeBlocks is a "flat" control flow graph (CFG). It is required by
> the peephole optimizer to not modify two instructions which are part
> of two code paths (two blocks).

Here we get to the core of the proposal.

As I show in the linked blog post, it takes a handful of lines to go back and forth between the proposed format and a block-graph format. It's just as easy to go back and forth between having pseudo-instructions and not having them. Or any other format you come up with.

That's not true for raw bytecode--going back and forth requires writing a complicated disassembler and even more complicated assembler.

But, even more important, the proposed format is the same between CPython 3.6 and MicroPython 3.6, and it stays the same even if CPython 3.7 switches to wordcode. And any code you've written that builds a block graph out of the proposed format still works.

That's what makes the proposed format a portable, resilient format. And I believe it's the simplest possible portable, resilient format.

It's not the ideal format to use for every possible kind of bytecode manipulation. That isn't the goal. The fact that it happens to be good enough for a lot of kinds of bytecode manipulation is a nice side benefit, but it's not the point. The fact that it integrates nicely with dis is also very nice, but it's not the point.

So, "let's build yet another third-party assembler and disassembler with a different API" is not a competing solution to this proposal; it's part of the problem I'm trying to solve.

> By the way, I wrote PEP 511 for AST optimizers, not for bytecode
> optimizers.

As I've said before: you included bytecode optimizers in PEP 511, you made the API more complicated so you could allow them, you provide a rationale for why we need to allow them, and you gave an example of one. If the PEP is wrong, you don't have to convince anyone; it's your PEP, go change it.

Anyway, from here you go off onto a long tangent arguing that my proposed format is not the ideal once-and-for-all-best format to use for every possible kind of bytecode manipulation. I already granted that above, and I'll grant it again and snip all the arguments.

>> And PyCode_Assemble is the only new C API function needed.
> 
> I don't understand why do you care so much of having a C API. What do
> you want to do?

As explained near the top, I want to share code between the assemble function in the compiler and the assemble function used in Python code.

Ideally, I'd like to do this without having to expose any new types or utility functions or anything else to C.

And, as it turns out, that's doable. I can write a PyCode_Assemble function that's used by the compiler and by Python code without having to add a single other new thing to the C API.

>> * Any higher-level representations, like a graph of blocks with edges for the jumps between them, are easy enough to build on top of the dis representation (and to flatten back into that representation), so we don't need anything more complicated in the stdlib.
> 
> Yeah, you should start with something simple but extensible. An API
> generic enough to be usable as a low-level API by existing byteplay,
> codetransformer, bytecode projects, and then build an higher-level API
> on top of that. Or maybe I'm right and it's a bad idea :-)

I don't understand the last sentence. Are you contradicting the rest of the paragraph, and suggesting that a simple but extensible API that can be used by byteplay, etc. and new projects is a bad thing? If so, why? Do you think it would be better to bless one of those projects, and keep all the others as hard to write as they are today?