[Python-ideas] Standard (portable) bytecode "assembly" format

Fri Feb 26 18:47:57 EST 2016

On Feb 26, 2016, at 14:05, Victor Stinner <victor.stinner at gmail.com> wrote:
> 
> 2016-02-26 19:15 GMT+01:00 Andrew Barnert <abarnert at yahoo.com>:
>> Sure, we could either (a) have duplicate code in C and Python that do virtually the same assembly and fixup work, (b) rewrite the peephole optimizer and part of the compiler in Python and freeze both them and the dis module (or whatever), or (c) use a format that's accessible from both C and Python and change as little as possible to get what we want. I think the last one is clearly the best solution, but it's not because the other two aren't impossible.
>> (...)
>> As explained near the top, I want to share code between the assemble function in the compiler and the assemble function used in Python code.
>> 
>> Ideally, I'd like to do this without having to expose any new types or utility functions or anything else to C.
>> 
>> And, as it turns out, that's doable. I can write a PyCode_Assemble function that's used by the compiler and by Python code without having to add a single other new thing to the C API.
> 
> Currently, Python/compile.c uses specific C structures:

Yes. But most of that information is only needed earlier in the process--building the blocks, linearizing them, making sure each one ends in a return, etc. It requires a bit of reorganization to cleanly separate out the final assembly/fixup step, but not that much. And, while that last step does use the current structures today, it doesn't actually need anything from them but a way to iterate instructions. (I learned this during an earlier experiment, where I shared all of the compiler structures directly with the peephole optimizer and then tried to limit the sharing as much as possible.)

> My opinion on that is that it's not worth to modify Python/compile.c
> and leave it unchanged.

DRY. With a single assembler rather than two, anyone who wants to change anything about the internal bytecode format or how fixup works or anything else only has to do it once, rather than figuring out how to do the same thing in two completely different pieces of code that are intended to accomplish the same thing.

>>>> * Not mutable, and no assembler.
>>> 
>>> I looked at Bytecode & Instruction objects of dis. They look nice to
>>> "read" bytecode, but not to modify bytecode.
>>> 
>>> dis.Instruction is not mutable and informations are duplicated.
>> 
>> Which is exactly why I suggested the very alternative that you're replying to: tuples of (opcode, argval [, line [, ...]]) are trivial to build.
> 
> A tuple cannot be modified.

So what? Even your examples don't mutate the Instr object; they build a new one instead. It's not like "bc[i] = Instruction(LOAD_CONST, constval)" or "yield (LOAD_CONST, constval)" and so on are less readable/Pythonic/concise than "bc[i].opcode = LOAD_CONST; bc[i].argval = constval".

If you really think this is important, changing the format to any iterable of iterables instead of iterable of tuples is trivial, so you can use lists instead of tuples, and make Instruction mutable, and so on. But I don't see what it buys you.

Also, one more time: I'm not trying to invent the all-singing, all-dancing best-possible interface for all kinds of bytecode manipulation; I'm trying to invent the simplest portable and resilient format that people can build other APIs on top of. The dis module happens to provide a somewhat useful such API for simple manipulations, which is nice, but it will never be the best one for all manipulations, and that's fine. So I don't want to change dis any more than necessary. Your own API  can diverge much more radically from dis if it wants. It just has to take the same iterable-of-tuples format as input and output; it can store things internally however it wants, which can include mutable instruction objects if you want.

> In my bytecode API, you provide a name, the opcode is computed from
> the name.

To me, requiring something that duck-types like an int, and then providing an IntEnum of all the opcodes, seems like a much nicer API than requiring strings and providing str<->int maps. That's exactly what enums are for. But if you want to build a string API on top of the portable duck-types-as-int format instead, you can. All you have to do is emit the ints in the iterable you pass to assemble.

More generally, I don't think debating all, or even any, of the design decisions of your in-progress library is at all relevant to this discussion. Unless you think you have an example of something you can do with raw bytecode that you can't do with the portable format, it doesn't affect this proposal at all.

>>> BytecodeBlocks is a "flat" control flow graph (CFG). It is required by
>>> the peephole optimizer to not modify two instructions which are part
>>> of two code paths (two blocks).
>> 
>> Here we get to the core of the proposal.
>> 
>> As I show in the linked blog post, it takes a handful of lines to go back and forth between the proposed format and a block-graph format. It's just as easy to go back and forth between having pseudo-instructions and not having them. Or any other format you come up with.
> 
> I saw your "def blockify(instructions):" function, but I don't
> understand how do you store labels.

I gave examples that show how to do various things with and without storing labels. You're looking at one of the examples without storing labels, and asking where the labels are stored in that example. For that example, I instead make sure I have a complete dis.Bytecode with its is_jump_target fields filled in, and use that, to show that labels aren't always necessary.

The next few questions are mostly irrelevant, so I'm skipping most of them. In general, you're asking how your library could do block-related things directly on the portable format. In many cases, what you want to do is actually easy, but it doesn't matter, because your library isn't going to do that; it's going to take the portable format as input and output, and do things on a graph of blocks in between, and the format you use for that graph of blocks is entirely up to you, as long as you can linearize that back to an iterable of tuples as the end.

> I identified a requirement when you manipulate jumps: being able to
> "resolve jumps".
...
> With bytecode.Bytecode (list of instructions), you have to iterate on
> all iterations to search for the label.

No you don't. I gave examples that resolve jumps in O(1) time, both with and without labels. But, again, who cares? Your code won't be doing this.

> I'm not sure that a label is the most convenient abstraction for
> blocks.

And it doesn't have to be. As long as it's sufficient for you to build whatever more convenient abstraction you think you need.

>>> Yeah, you should start with something simple but extensible. An API
>>> generic enough to be usable as a low-level API by existing byteplay,
>>> codetransformer, bytecode projects, and then build an higher-level API
>>> on top of that. Or maybe I'm right and it's a bad idea :-)
>> 
>> I don't understand the last sentence.
> 
> Sorry, I wanted to write "maybe I'm wrong and it's a bad idea".

OK, then I think you're right, and it was a good idea. :)

And that's exactly what I've attempted to do: come up with the simplest API that can support things like byteplay, codetransformer, and bytecode so they don't have to directly manipulate the byte strings anymore, giving us more freedom to change the internal CPython format without breaking every bytecode processor in the world.