[Python-ideas] Standard (portable) bytecode "assembly" format

Fri Feb 26 17:05:51 EST 2016

2016-02-26 19:15 GMT+01:00 Andrew Barnert <abarnert at yahoo.com>:
> Sure, we could either (a) have duplicate code in C and Python that do virtually the same assembly and fixup work, (b) rewrite the peephole optimizer and part of the compiler in Python and freeze both them and the dis module (or whatever), or (c) use a format that's accessible from both C and Python and change as little as possible to get what we want. I think the last one is clearly the best solution, but it's not because the other two aren't impossible.
> (...)
> As explained near the top, I want to share code between the assemble function in the compiler and the assemble function used in Python code.
>
> Ideally, I'd like to do this without having to expose any new types or utility functions or anything else to C.
>
> And, as it turns out, that's doable. I can write a PyCode_Assemble function that's used by the compiler and by Python code without having to add a single other new thing to the C API.

Currently, Python/compile.c uses specific C structures:

* struct instr: opcode, oparg, ... a jump target is pointer to a basicblock
* struct basicblock: list of instructions, ...
* struct fblockinfo
* struct compiler_unit: list of constants, list of names, blocks, etc.
* struct compiler: filename, compiler_unit, ...
* ...

Your proposal looks more like a flat list of instructions, it doesn't
fit well with the current code (blocks).

The structures contain many information which are specific to the
compiler, I'm not sure that it would make sense to put them in your
generic API. Or maybe you can rebuild the current structures on top of
your API.

My opinion on that is that it's not worth to modify Python/compile.c
and leave it unchanged.

>>> * Not mutable, and no assembler.
>>
>> I looked at Bytecode & Instruction objects of dis. They look nice to
>> "read" bytecode, but not to modify bytecode.
>>
>> dis.Instruction is not mutable and informations are duplicated.
>
> Which is exactly why I suggested the very alternative that you're replying to: tuples of (opcode, argval [, line [, ...]]) are trivial to build.

A tuple cannot be modified.

By mutable, I mean being able to replace an attribute without having
to create a new instruction:
  instr.arg = new_arg
instead of
  bytecode[index] = instr.replace_arg(arg)

In the first version in my bytecode project, I hesitated between
abstract instruction and concrete instruction. I wanted to put checks,
so I started with immutable instructions. But it's not really
convenient. I would prefer mutable instructions. I left concrete
instructions immutable, because arguments depend on a bytecode object.
For example, LOAD_CONST uses an index in a list of constants. And jump
targets depend on the exact size of other instructions. Maybe concrete
bytecode should be made mutable too. But it's not too hard to create a
new concrete instruction to replace an existing one.

> Here's an example of what a bytecode processor could look like:
>
>     for opcode, argval, *rest in instructions:

Hum, "*rest" doesn't look good to me. What is the exact size of an
instruction? (how many fields) What if we want to add a new field
later? Will it break existing code relying on the minimum/maximum
number of fields of an instruction?

> If you want to use the dis structures instead, you don't have to, but you can:
>
>     bc = dis.Bytecode(instructions)
>     for i, instr in enumerate(bc):

Yeah, this API looks better: a single object which contains all
information. It's more future-proof. (I just talking about the for
"instr in bytecode:" :-))

>> For
>> example, the operator is stored as name (LOAD_CONST) and code (100).
>> Argument is stored as int (1), value ("hello") and representation
>> ('"hello"'). It has no methods but attributes like is_jump_target.
>
> And, as I said, you only have to supply opcode, argval, and sometimes line. The other attributes are there for reading existing bytecode, but aren't needed for emitting it.

Hum, I understand that an instruction is a named tuple. So if I create
an instruction only with the opcode (ex: 100), the name field is not
set, right? Which fields are "mandatory"? Which fields are optional?

In my bytecode API, you provide a name, the opcode is computed from
the name. You can modify the name, opcode is updated. If you modify
opcode, name is updated. There are checks on lineno attributes (must
be an int >= 1, or None). ConcreteInstr has strict checks on the
argument.

>> dis.Instruction doesn't seem extensible to add new features.
>
> Why not? I added hasjrel to see how easy it is: there's one obvious way to do it, which took a few seconds, and it works exactly as I'd want it to. What kind of new features do you think would be difficult to add?

In bytecode 0.1, I have the following methods on Instr:

* format(labels)
* __repr__()
* __eq__(): smart comparison. For LOAD_CONST, it understands that -0.0
argument is different than +0.0 for example.
* is_jump()
* is_cond_jump()

ConcreteInstr has additional methods:

* assemble()
* ConcreteInstr.disassemble() (static method)
* get_jump_target(instr_offset)

About your hasjrel example: do you mean that you added a new field to
the namedtuple? Does the constructor of the instruction have to fill
this field manually? What if the field is not set?

>> Concrete bytecode & instructions is closer to what we already have in
>> the dis module.
>
> No it isn't. What we have in the dis module does _not_ have size; it's a flat sequence of instructions. If you've missed that, you probably need to go back and reread the proposal, because it doesn't really make sense if you think this is what it's suggesting.

The offset attribute doesn't seem revelant for an abstract
instruciton. If you remove an instruction before, the offset becomes
inconsistent.

I chose to not store the offset inside instructions, but recompute it
each time that I iterate on concrete instructions (offset +=
instr.size).

>> BytecodeBlocks is a "flat" control flow graph (CFG). It is required by
>> the peephole optimizer to not modify two instructions which are part
>> of two code paths (two blocks).
>
> Here we get to the core of the proposal.
>
> As I show in the linked blog post, it takes a handful of lines to go back and forth between the proposed format and a block-graph format. It's just as easy to go back and forth between having pseudo-instructions and not having them. Or any other format you come up with.

I saw your "def blockify(instructions):" function, but I don't
understand how do you store labels.

You use a "is_jump_target" attribute. If you remove the target of a
jump (an instruction with is_jump_target=True), I guess that you have
to mark the following instrution with is_jump_target=True, right? What
if the block only contains one instruction?

I identified a requirement when you manipulate jumps: being able to
"resolve jumps". From a jump, you want to know the target instruction.

With bytecode.BytecodeBlocks, you get the target block with:
"target_block = bytecode[jump.label]; target_instr = target_block[0]"
(with a complexity of O(1), bycode[label] gets an item of an list, it
uses a mapping label => block index to get the index.)

With bytecode.Bytecode (list of instructions), you have to iterate on
all iterations to search for the label. I can maybe optimize that
later to build an internal cache, updated when the list is modified.

I'm not sure that a label is the most convenient abstraction for
blocks. In CFG, a jump points directly to a subtree (the instruction
argument is directly the block), there is no indirection like my label
object.

In bytecode, you can also convert bytecode between the 3 formats
(concrete, bytecode, blocks), the 3 classes have 5 conversion methods:

* from_code()
* to_code()
* to_concrete_bytecode()
* to_bytecode()
* to_bytecode_blocks()

> So, "let's build yet another third-party assembler and disassembler with a different API" is not a competing solution to this proposal; it's part of the problem I'm trying to solve.

I wrote the bytecode project to try to implement you idea. It looks
like we don't want the same API :-)

>> Yeah, you should start with something simple but extensible. An API
>> generic enough to be usable as a low-level API by existing byteplay,
>> codetransformer, bytecode projects, and then build an higher-level API
>> on top of that. Or maybe I'm right and it's a bad idea :-)
>
> I don't understand the last sentence.

Sorry, I wanted to write "maybe I'm wrong and it's a bad idea".

Victor