Mailman 3 compiler optimizations: collecting ideas - pypy-dev

newer
Re: [pypy-dev] [pypy-svn] r60055...

compiler optimizations: collecting ideas

Daniel Furrer

Nov. 14, 2008

2:48 p.m.

[This message has been posted to python-dev as well. Sorry for the duplication to those who read both.] Hello everybody, As part of an advanced compiler design course at our university (ETH Zurich), we have to implement an optimization (or various thereof). I've spoken with a couple of people who are, like me, very fascinated by Python. So I would just like to ask if somebody has any ideas on what we could do? Our constraints: - We are 4 persons, each willing to potentially spend around 4-6 full days of work. - We've been working on generating Control Flow Graphs, generating Static Single Assignment Form (inserting phi-nodes) and removing it again. We've also implemented techniques like constant folding, copy propagation, Common Subexpression Elimination etc. We're looking for: - Hints as to which python compiler would be best to work on. (The official Python compiler package seems very nice to work with actually... [and so does PyPy.]) - If there are any requests in the area of optimization that we could have a look at Any feedback would be highly appreciated as well as pointers to specific bugs in the tracker and other relevant discussions. Best regards, Daniel

Attachments:

attachment.html (text/html — 1.3 KB)

Show replies by date

Carl Friedrich Bolz

November 2008

1:26 p.m.

Hi Daniel, sorry for the late response. Daniel Furrer wrote:

...

As part of an advanced compiler design course at our university (ETH Zurich), we have to implement an optimization (or various thereof). I've spoken with a couple of people who are, like me, very fascinated by Python. So I would just like to ask if somebody has any ideas on what we could do?

Our constraints: - We are 4 persons, each willing to potentially spend around 4-6 full days of work. - We've been working on generating Control Flow Graphs, generating Static Single Assignment Form (inserting phi-nodes) and removing it again. We've also implemented techniques like constant folding, copy propagation, Common Subexpression Elimination etc.

We're looking for: - Hints as to which python compiler would be best to work on. (The official Python compiler package seems very nice to work with actually... [and so does PyPy.]) - If there are any requests in the area of optimization that we could have a look at

Any feedback would be highly appreciated as well as pointers to specific bugs in the tracker and other relevant discussions.

I guess the problem for "classical" compiler optimizations applied to a compiler producing Python bytecode is that most of them are potentially unsafe. E.g. you cannot do CSE in Python, because any expression can have arbitrary side-effects. Therefore it is very limited which optimizations can be applied at all. PyPy's translation toolchain works in SSI as an intermediate representation (which is a sub-set of SSA). However, many of the straightforward optimizations (constant-folding, copy-progragation, inlining, a form of escape analysis) have already been implemented. Some things are not done yet, like common subexpression elimination. We didn't bother to implement them yet, because they are not as useful for PyPy since we target C, and most C compilers can do theses things for us. Cheers, Carl Friedrich

Paolo Giarrusso

1:51 p.m.

Hi, first I'd like to state I'm not part of the Zurich group, so what follows are just my 2 cents. On Mon, Nov 17, 2008 at 14:26, Carl Friedrich Bolz <cfbolz@gmx.de> wrote:

...

Hi Daniel,

...

Daniel Furrer wrote:

...
We're looking for: - Hints as to which python compiler would be best to work on. (The official Python compiler package seems very nice to work with actually... [and so does PyPy.])

...

I guess the problem for "classical" compiler optimizations applied to a compiler producing Python bytecode is that most of them are potentially unsafe. E.g. you cannot do CSE in Python, because any expression can have arbitrary side-effects. Therefore it is very limited which optimizations can be applied at all.

I guess that is because an object can define a custom '+' operator, for instance, right? But aren't those optimizations possible anyway after specialization and inlining? And anyway, who said that CSE should be applied to the Python bytecode? In fact, I think both Java and .NET do it inside the VM. Then, the next point is that you do not have yet a JIT. But still, I think it is possible to specialize bytecode to eliminate dynamic lookups and enable optimizations such as CSE. But the benefit of specialized bytecode can be significant, I guess, only if the interpreter is really fast (either a threaded one, or a code-copying one). Is the PyPy interpreter threaded?

...

PyPy's translation toolchain works in SSI as an intermediate representation (which is a sub-set of SSA). However, many of the straightforward optimizations (constant-folding, copy-progragation, inlining, a form of escape analysis) have already been implemented. Some things are not done yet, like common subexpression elimination. We didn't bother to implement them yet, because they are not as useful for PyPy since we target C, and most C compilers can do theses things for us.

-- Paolo Giarrusso

Antonio Cuni

2:05 p.m.

Paolo Giarrusso wrote:

...

specialized bytecode can be significant, I guess, only if the interpreter is really fast (either a threaded one, or a code-copying one). Is the PyPy interpreter threaded?

sometime ago I tried to measure if/how much we can gain with a threaded interpreter. I manually modified the produced C code to make the main loop threaded, but we didn't gain anything; I think there are three possible reasons: 1) in Python a lot of opcodes are quite complex and time-consuming, so the time spent to dispatch to them is a little percentage of the total time spent for the execution 2) due to Python's semantics, it's not possible to just jump from one opcode to the next, as we need to do a lot of bookkeeping, like remembering what was the last line executed, etc. This means that the trampolines at the end of each opcode contains a lot code duplication, leading to a bigger main loop, with possibly bad effects on the cache (didn't measure this, though) 3) it's possible that I did something wrong, so in that case my measurements are completely useless :-). If anyone wants to try again, it cannot hurt. ciao, Anto

Paolo Giarrusso

2:32 p.m.

On Mon, Nov 17, 2008 at 15:05, Antonio Cuni <anto.cuni@gmail.com> wrote:

...

Paolo Giarrusso wrote:

...
specialized bytecode can be significant, I guess, only if the interpreter is really fast (either a threaded one, or a code-copying one). Is the PyPy interpreter threaded?

sometime ago I tried to measure if/how much we can gain with a threaded interpreter. I manually modified the produced C code to make the main loop threaded, but we didn't gain anything; I think there are three possible reasons:

...

1) in Python a lot of opcodes are quite complex and time-consuming, so the time spent to dispatch to them is a little percentage of the total time spent for the execution That's something difficult to believe, I think. Well, it is possible to profile execution to count mispredictions, to avoid having to think about it.

Well, since arithmetic ops may involve a method lookup, I understand what you mean. OTOH, a method lookup means just one dictionary lookup when first seeing the bytecode (I would save the hashcode inline in the bytecode after first execution actually, to speed that up), and an unpredictable indirect branch; making the dispatch branch more predictable by threading should still have a significant impact.

...

2) due to Python's semantics, it's not possible to just jump from one opcode to the next, as we need to do a lot of bookkeeping, like remembering what was the last line executed, etc.

This is a more likely culprit. But... all languages that I know of (Java, compiled languages) just have a reverse map from bytecode positions to the original line information, to be used when and if they are needed (i.e. as debug info, or to unwind the stack when an exception is thrown I guess). Isn't that possible for Python? In any

...

This means that the trampolines at the end of each opcode contains a lot code duplication, leading to a bigger main loop, with possibly bad effects on the cache (didn't measure this, though)

...

3) it's possible that I did something wrong, so in that case my measurements are completely useless :-). If anyone wants to try again, it cannot hurt.

Do you have the original code somewhere? I know one has to redo it anyway even because that's generated code, but still guidance is helpful. -- Paolo Giarrusso

Paolo Giarrusso

2:44 p.m.

Sorry, I sent the unfinished mail by mistake.

...

...
2) due to Python's semantics, it's not possible to just jump from one opcode to the next, as we need to do a lot of bookkeeping, like remembering what was the last line executed, etc.

This is a more likely culprit. But... all languages that I know of (Java, compiled languages) just have a reverse map from bytecode positions to the original line information, to be used when and if they are needed (i.e. as debug info, or to unwind the stack when an exception is thrown I guess). Isn't that possible for Python?

Additionally, could you make some more examples?

...

In any [the unfinished sentence] In any case, have you read "The Structure and Performance of Eﬃcient Interpreters"?, by M. Ertl and D. Gregg? It's a paper that really matters in practice, for fast interpreters (and if you don't believe me, believe http://webkit.org/blog/189/announcing-squirrelfish/).

Basically, what it says is that in efficient interpreters, most of the runtime cost is caused by indirect branches (they go as far as taking that as a definition of "efficient interpreter"). And that applies also to Scheme (which, I think, is not less dynamic than Python - any arith instruction must still check the argument types and choose between the fast path and throwing an error). My first impression is that the problem was that the interpreter is not yet optimized enough, and that comes from various hints: - You do bookkeeping for the last executed line (unless tables prove impossible to apply) - We discussed about removing dictionary lookups for attributes little time ago, and one PyPy developers said "it doesn't really matter, an interpreter is so slow anyway". This is the original quote from Armin Rigo (http://codespeak.net/pipermail/pypy-dev/2008q4/004862.html): "I don't know how much speed-ups this would give; in fact it is rather unclear to me why people think that dictionary lookups are seriously bad. Of course they are more expensive than just reading an item out of a list, but I don't think it's worse than roughly 3 times slower; this number looks big in the context of machine code, but insignificant in the context of a complicated bytecode interpreter. I'm ready to be proven wrong, of course." And anyway, Armin itself suggested there is space for optimizing the interpreter. -- Paolo Giarrusso

Daniel Furrer

3:04 p.m.

Thanks for your answers. On Mon, Nov 17, 2008 at 2:26 PM, Carl Friedrich Bolz <cfbolz@gmx.de> wrote:

...

I guess the problem for "classical" compiler optimizations applied to a compiler producing Python bytecode is that most of them are potentially unsafe. E.g. you cannot do CSE in Python, because any expression can have arbitrary side-effects. Therefore it is very limited which optimizations can be applied at all.

Well, the side-effects are a problem, so we can not do any method invocations (unless we would check that the methods are pure). Thinking about this: it's not even easily possible to optimize the subset of numerical operations because we don't have any static type information. woops...

...

PyPy's translation toolchain works in SSI as an intermediate representation (which is a sub-set of SSA). However, many of the straightforward optimizations (constant-folding, copy-progragation, inlining, a form of escape analysis) have already been implemented. Some things are not done yet, like common subexpression elimination. We didn't bother to implement them yet, because they are not as useful for PyPy since we target C, and most C compilers can do theses things for us.

That's true. But I would assume they do constant folding and copy propagation as well then, don't they? (How about PRE?) So just out of interest: In which area would you start optimizing PyPy? cheers, Daniel

Paolo Giarrusso

3:15 p.m.

On Mon, Nov 17, 2008 at 16:04, Daniel Furrer <daniel.furrer@gmail.com> wrote:

...

Thanks for your answers.

On Mon, Nov 17, 2008 at 2:26 PM, Carl Friedrich Bolz <cfbolz@gmx.de> wrote:

...
I guess the problem for "classical" compiler optimizations applied to a compiler producing Python bytecode is that most of them are potentially unsafe. E.g. you cannot do CSE in Python, because any expression can have arbitrary side-effects. Therefore it is very limited which optimizations can be applied at all.

Well, the side-effects are a problem, so we can not do any method invocations (unless we would check that the methods are pure). Thinking about this: it's not even easily possible to optimize the subset of numerical operations because we don't have any static type information. woops...

Indeed, existing work solved this problem in the VM, by means of runtime specialization (you generate a version of the method for the commonly used argument types, so that for each specialized version you have static type information); there is a lot of literature of the late '80s and early '90s about this, mainly for Smalltalk, but also on Self. Some recommended papers on this are: * A Survey of Adaptive Optimization in Virtual Machines, Matthew Arnold, Stephen J. Find, David Grove, Michael Hind, and Peter F. Sweeney, IEEE. * An Efficient Implementation of SELF, a Dynamically-Typed Object-Oriented Language Based on Prototypes, Craig Chambers, David Ungar, and Elgin Lee, 1991 (specific discussion of code specialization, for languages such as JavaScript and Python). Regards -- Paolo Giarrusso

5928

Age (days ago)

5931

Last active (days ago)

List overview

Download

7 comments

4 participants

participants (4)

Antonio Cuni
Carl Friedrich Bolz
Daniel Furrer
Paolo Giarrusso

compiler optimizations: collecting ideas

Daniel Furrer

Carl Friedrich Bolz

Paolo Giarrusso

Antonio Cuni

Paolo Giarrusso

Paolo Giarrusso

Daniel Furrer

Paolo Giarrusso

tags

participants (4)