[Python-ideas] Incorporating something like byteplay into the stdlib

Fri Feb 12 14:16:16 EST 2016

On Feb 12, 2016, at 04:05, Victor Stinner <victor.stinner at gmail.com> wrote:
> 
> Hi,
> 
> 2016-02-12 4:58 GMT+01:00 Andrew Barnert via Python-ideas
> <python-ideas at python.org>:
>> tl;dr: We should turn dis.Bytecode into a builtin mutable structure similar to byteplay.Code, to make PEP 511 bytecode transformers implementable.
> 
> Hum, it looks like your email is highly coupled to the PEP 511.

Very much so. The basic idea could be done without PEP 511, but there wouldn't be any urgency to doing it if the only client is the peephole optimizer...

> First of all, I really want to support bytecode transformer because I
> would like to be able to disable the peephole optimizer. Having the
> peephole registered in code transformers as AST transformers make the
> whole PEP more consistent. There is no more special case for peephole
> optimizer.

OK. But if the peephole optimizer will almost always be the only bytecode processor present, with everything else being AST processors, then it already is special, and making it more generic in a way that still doesn't encompass anything else doesn't change that, it just adds complexity.

Personally, I think there probably are useful optimizations that can only be done by bytecode. But the fact that nobody has come up with any compelling examples (while there are lots of compelling examples for AST processors) might imply that we should put it off until someone needs it.

>> Why?
>> 
>> ----
>> 
>> The first problem is that the API for bytecode transformers is incomplete. You don't get, e.g., varnames, so you can't even write a transformer that looks up or modifies locals. But that part is easy to fix, so I won't dwell on it.
> 
> I agree that we can enhance the Python stdlib to ease manipulation of
> bytecode, but I disagree that it's a requirement. It's ok to use an
> external library (like byteplay) for that.

You're either answering out of order, or you missed the point of this paragraph. Your API passes only the bytecode string, consts, lnotab, and global names. There is no way to write a bytecode processor that looks up locals with that interface, because you didn't pass the locals.

As I said, that's easy to fix--we could pass all 12 or so necessary parameters, or wrap them up with the 5 or so unnecessary ones in a code object one step earlier in the process and pass that, etc. But that just gets us to the second problem, which is harder to fix.

>> The second problem is that bytecode is just painful to work with. The peephole optimizer deals with this by just punting and returning the original code whenever it sees anything remotely complicated (which I don't think we want to encourage for all bytecode transformers), and it's _still_ pretty hairy code. And it's the kind of code that can easily harbor hard-to-spot and harder-to-debug bugs (line numbers occasionally off on tracebacks, segfaults on about 1/255 programs that do something uncommon, that kind of fun stuff).
> 
> Sorry, I don't understand. Are you writing that the CPython peephole
> optimizer produces invalid code?

No. As I said, the peephole optimizer deals with this by punting and returning the original code whenever things get remotely complicated (e.g., long functions).

I don't think we want third-party optimizers to similarly punt on anything complicated. So, they _will_ have bugs like this, unless there's some way of making it tractable for them.

>> The compiler is already doing this work. There's no reason every bytecode processor should have to repeat all of it. I'm not so worried about performance here--technically, fixup is worst-case quadratic, but practically I doubt we spend much time on it--but simplicity. Why should everyone have to repeat dozens of lines of complicated code to write a simple 10-line transformer?
> 
> Hum, are you talking about the API proposed in the PEP 511? I understand that you are saying the API only takes a whole code
> object as input and produces a code object as output. An optimizer
> usually needs a different structure to be able to modify the code. If
> you have multiple bytecode transformers, you have to repeat these
> "disassemble" and "assemble" steps, right?

Well, the API currently doesn't even take a whole code object--but other than that, yes, that's exactly what I'm saying.

> I don't think that we will have plently bytecode optimizers in the
> wild. Even if two or three major bytecode optimizers become popular,
> are you sure that we will want to combine them? I expect that a single
> optimizer implements *all* optimizations. I don't see the point of
> running multiple optimizers to implement multiple optimizations steps.
> For example, my fatoptimizer AST optimizer implements multiple steps,
> but *internally*. It is only called once on the AST.

You seem to be focusing on the performance issue here, while I think that's a minor consideration at best. The problem (probably) isn't that it's too costly for the CPU to run the disassembly and assembly code multiple times, but that it's too costly for the Python community to write and maintain disassembly and assembly code multiple times.

I suppose if you're expecting that everyone experimenting with a new optimizer will do so by writing it as a patch to one of the three major optimizer projects rather than a standalone project, that might still solve much of the real problem. But do you really think (a) that's the way things should go, and (b) that's the way things probably _will_ go?

> I don't think that performance of importing modules really matters. My
> PEP 511 is mostly written for compilation ahead of time, to support
> complex and expensive optimizers.
> 
> The real problem is to run a script: "python script.py" always has to
> execute all code transformers. For scripts, I hesitate to simply
> disable expensive optimizers, or maybe even disable all optimizers.
> For example, a script can run less than 50 ms, is it worth to spend 10
> ms to optimize it to a get speedup of 1 ms? (no)

I didn't think the performance of optimizers would matter _at all_, which is why I was most focused on the simplicity and DRY argument. If it really is an issue for startup time for small scripts, then repeating the same disassemble/assemble steps multiple times may make that issue worse, but I don't think we'd know that without profiling. So I still think this is probably the least important of my arguments for the idea.

> The problem of using a specific format for bytecode rather than a code
> object is that we will have to maintain it.

We're already maintaining such a format in the stdlib today, and have been for years: the dis module.

The problem is that it's an immutable format, and has only a disassembler but no assembler, and no way to pass code into the assembler we already have under the covers. My proposal is essentially just to fix that.

Everything else is just gravy. For example, for very little extra work, we could make the mutable dis module replace _all_ need for byteplay, so I think we should do that very little extra work--but if we don't, the core idea is still useful on its own.

> I'm not sure that all
> bytecode optimizer want the same internal structures. For some kind of
> optimizations, a sequential list of instructions is enough. For some
> other optimizations, you need to split blocks of code to have a
> representation of the exacty "control flow".

Even then, it's far easier to build a tree of blocks from the dis structure than from the bytecode-and-lnotab-and-arrays structure--and even more so to do the assembly and fixups if you're just assembling the dis structure rather than the bytecode-and-etc. structure. (And, if optimizer performance is really an issue--again, I doubt it is, but just in case--it should also be much less expensive to chain your third-party optimizer and the peephole optimizer if you don't need to do the fixups in between.)

> Last point, the PEP 511 has to take in account the existing peephole
> optimizer implement in C. If you really want to use a different
> structure, you will have to reimplement the peephole optimizer with
> your new API.

Yes. But that's more part of the motivation than a problem. From my exploratory steps, rewriting the peephole optimizer around the dis API is less work than almost any nontrivial change to the peephole optimizer as-is (far less work than changing it to understand wordcode, which I was playing with last weekend). 

Also, I'm curious if there would be any measurable performance benefit from the peephole optimizer not punting on large functions. 

No huge benefits here, but as long as someone's willing to do the work--and I am willing, unless Serhiy (who definitely knows CPython better than me, and seems just as motivated here) wants to do it first.

> Since my target is AST, I'm not really interested by
> that :-)

And that brings us back to the alternative idea of just not dealing with bytecode in PEP 511.

>> I played around with a few possibilities from fixed-width bytecode and uncompressed lnotab to a public version of the internal assembler structs, but I think the best one is a flat sequence of instructions, with pseudo-instructions for labels and line numbers, and jump targets just references to those label instructions.
> 
> Could you try to reimplement the whole peephole optimizer to see if it
> benefit of your design?

Yes, if I work on this over the weekend, reimplementing the peephole optimizer will definitely be part of the PoC. After all, we don't have a large code base of other bytecode optimizers used in the wild to play with yet. :)

> I played with bytecode in the past. At the send, I started to
> implement optimizations which can be implemented simpler at AST level.
> 
> Why do you prefer bytecode over AST?

I don't prefer bytecode over AST. As I've said multiple times, including in the same email you're responding to, I think 90% of the optimizations you could do in bytecode could instead be done, usually more simply (even with something like byteplay), at the AST level. The only question is whether that last 10% matters.

> Your example of converting
> globals to constants became trivial to implement using my new
> ast.Constant node.

Imagine you hadn't thought of that optimization (not likely, since it's the first example everyone uses, but you know what I mean), and we'd released Python 3.6 with PEP 511 and without ast.Constant, and now I thought of it. How would I implement it in a third-party optimizer? I can't patch the compiler to add a new AST node. (Nobody's going to install my optimizer if the steps are "get the CPython source, apply this patch, rebuild with the same config as your existing compiler, ...") And there's no way to implement it with the AST with the existing nodes. So I'd have to write it as a bytecode transformer.

Of course once I released that to PyPI and everyone loved it and we had benchmarks to back it up, then I could come back and say, "We should add an ast.Constant node in Python 3.7, because that would let me rewrite my global optimizer as a simpler AST processor." And a few years later Python 3.6 would stop being relevant enough to keep maintaining the bytecode version.

Meanwhile, here's a completely different example: the Python compiler emits a lot of jumps to jump instructions. The peephole optimizer takes care of some, but not all, of these. Maybe with an ast.Goto node, or extra fields in the various for/try/etc. nodes that don't get generated by parse but can be added by an AST processor, etc., they could all be handled at the AST level--but I'm not sure about that, and I think it would be significantly more complicated than doing it at the bytecode level. So, it's still one of the 10% cases.

Again, I think it's possible that the Python community could live without that 10%. If the globals idea had to wait a year and a half for a new AST node to be added, and the double-jump fixes in the existing peephole optimizer were all we ever had, I think CPython would do just fine. (In fact, I think the whole mania for micro-optimizing CPython is mostly misdirected energy in the first place... But I could be wrong about that, and if other people think it's useful, I think it's fun.:)) Which is why I think maybe removing bytecode processors from PEP 511 is a real alternative. It's only if we think we _do_ need bytecode processors that we need a way to write them.

>> * It needs a C API, and probably a C implementation.
> 
> I don't like extending the Python C API, it is already very large, we
> have too many functions :-p
> 
> A C API is more expensive to maintain than a Python API.

Well, if PEP 511 only supports optimizers written in Python (or written in C, but using the Python API), then the only real need for the C API is the peephole optimizer. So (following my own advice about not overgeneralizing a special case, which I should have thought of before...), I think you're right here. Leave the dis API as pure Python, and keep anything that has to be done in C private to the compiler (maybe shared with the peephole optimizer if necessary, but not exposed as a public and documented C API).

So, scratch that line--which makes the proposed change a lot smaller.

> I would prefer to continue to play with an external module (hosted on
> PyPI) to not pay the price of maintenance!

The dis module is already in the stdlib. The byteplay module is a 8 years old in its current form, even older in its original form, and it's become abundantly clear that the only problem with maintaining byteplay is syncing up with changes in dis and the rest of CPython. (The dis module, on the other hand, actually _does_ change, far more often and more substantially than byteplay, and it fits into the CPython release schedule pretty nicely.)

> By the way, what is the final goal? Do you plan to implement a new
> ultra optimized bytecode optimizer? If yes, do you plan to integrate
> it into CPython? If no, I don't think that we have to pay the price of
> maintenance for such "toy" project.

I don't think it's a "toy" project any more than the existing dis module is.

At any rate, the end goals are, in rapidly-descending order of importance:

 1. Make it feasible for people to write bytecode transformers.
 2. Encourage experimentation by allowing people to move mostly the same code between decorators and PEP 511 optimizers and even built in to CPython.
 3. Make the compiler a little simpler, and the peephole optimizer a lot simpler.
 4. Possibly improve the performance of the optimization step, and possibly improve the performance benefits of the peephole optimizer (by having it not punt on complicated functions).
 5. Enable experimentation on low-level parts of CPython (e.g., the only hard part of the wordcode patch in 3.5 is the peephole optimizer--if that went away, we could have more experiments in that vein).
 6. Remove the maintenance headaches for byteplay and modules that depend on it.

As I said, the importance decreases rapidly, and if we decide that bytecode optimizers aren't that important, and leave them out of PEP 511, that pretty much knocks out the top 2, which I think is more than enough to kill my proposal.

>> Anyway, I realize this API is still a little vague, (...)
> 
> It doesn't fit requirements to put something into the Python stdlib.
> Usually, we experiment stuff on PyPI, wait until it becomes mature,
> and then propose to integrate it.
> 
> It looks like you are talking about creating a new API and directly
> put it into the stdlib, right?

I'm talking about making the smallest possible changes to the _existing_ dis API that's _already_ in the stdlib, and basing those changes on a very mature third-party library that's long since defeated all of its competition.

Meanwhile, "just put it on PyPI" isn't an option. A third-party module could monkeypatch dis, but it can't change the order in which the builtin compiler does the assembly, optimizer, and fixup steps, or change the interface to PEP 511 processors and the peephole optimizer.

As I said, there _is_ the alternative of making a smaller change (like handing complete code objects to processors and then recommending byteplay), which solves not all but some of the problems for less work, and there's also the alternative of just dropping the bytecode processor API, which means most of the problems don't need to be solved, for no work.

>> We could just pass code objects (or all the separate pieces, instead of some of them), and then the docs could suggest using byteplay for non-trivial bytecode transformers, and then everyone will just end up using byteplay.
> 
> Again, you need to elaborate your rationale. What are your use cases?
> Which kind of optimizations do you want to implement?

Any kind of optimizations that need to work on bytecode. My rationale for that is that PEP 511 has a rationale for such optimizations. If that rationale isn't good enough, PEP 511 should only handle AST processors, in which case most of the problem I'm trying to solve won't exist in the first place.

>> So, what's wrong with that? The biggest problem is that, after each new Python release, anyone using a bytecode transformer will have to wait until byteplay is updated before they can update Python.
> 
> Why not contributing to byteplay to support the next CPython release?

I've done that before. So have other people. The problem is that someone has to notice that something about CPython bytecode and/or the dis module has changed, and figure out how byteplay needs to change to handle that, and then figure out how to make the change backward-compatibly. If it were integrated in the dis module, then every change that anyone makes would automatically be handled, because you already have to update the dis module (which is almost always very simple--and would continue to be so). And it would be the person who best understands the change making the update, instead of someone who just discovered that 3.7 makes his code raise a SystemError if run through a bytecode processor that he's been using for the last two years.

Essentially, integrating this into the existing dis module means we track changes for free, and it's hard to beat free. 

> I don't understand your problem here.

I think that's because you've never tried to write a non-trivial bytecode processor.

>> Or we could just remove bytecode transformers from PEP 511. PEP 511 still seems worth doing to me, even if it only has AST transformers, especially since all or nearly all of the examples anyone's come up with for it are implementable (and easier to implement) at the AST level.
> 
> I want to plug the existing peephole optimizer into my PEP 511 since
> it is an obvious *code* transformer. Even if changes are minor, some
> users want to disable it because it really changes the code. It would
> help code coverage for example.

If it really is a special case that needs to be handled, don't over generalize it to cover other cases that you're never going to need.

I think we might want those other cases. If so, we need to make them writable. If not, we shouldn't enable them half-way.