[pypy-dev] Threaded interpretation (was: Re: compiler optimizations: collecting ideas)

Sun Jan 4 11:19:13 CET 2009

Hi Paolo,

Happy new year!

I am getting back to one of the points that Anto has made which you
dismissed. I still think that it is the central one to explain why we
don't think that threaded interpretation is the panacea that you think
it is.

Paolo Giarrusso wrote:
> On Fri, Jan 2, 2009 at 14:23, Armin Rigo <arigo at tunes.org> wrote:
>> On Thu, Dec 25, 2008 at 12:42:18AM +0100, Paolo Giarrusso wrote:
>>> If I'll want to try something without refcounting, I'll guess I'd
>>>  turn to PyPy, but don't hold your breath for that. The fact that
>>>  indirect threading didn't work, that you're 1.5-2x slower than 
>>> CPython, and that you store locals in frame objects, they all 
>>> show that the abstraction overhead of the interpret is too high.
> 
>> True, but a 1.5x slowdown is not a big deal on many application; 
>> the blocker is mostly elsewhere.
> 
> Let's say 1.5x * 2x = 3x, since CPython is not as fast as it could 
> be, because of refcounting for instance. The 2x is taken from PyVM 
> performance reports (see http://www.python.org/dev/implementations/).
>  And for PyPy a slower interpreter means a slower JIT output, isn't 
> it? See below.
> 
> Also, according to CS literature, interpreter performance makes more
>  difference than JIT. According to the paper about efficient 
> interpreters I'm mentioning endless times, an inefficient interpreter
>  is 1000x slower than C, while an efficient one is only 10x slower. 
> Python is not that slow, but what you wrote about Psyco seems to 
> imply that there is still lots of room for improvement.

I finally looked at the paper. I think that one of the basic premises of
the paper does not apply to Python. This is the premise that a VM
instruction itself is a simple operation. For example the paper states:

"Most VM instructions are quite simple (e.g., add the top two numbers on
the stack), and can be implemented in a few native instructions plus
dispatch code. Complex VM instructions occur relatively rarely, and the
setup for them typically takes many simple VM instructions; [...]"

This is just nonsense in Python, and indeed in many other dynamic
languages as well. Take a simple fibonacci function:

>>> def fibo(n):
...     if n <= 2:
...         return 1
...     return fibo(n - 1) + fibo(n - 2)
...
>>> import dis dis.dis(fibo)
  2           0 LOAD_FAST                0 (n)
              3 LOAD_CONST               1 (2)
              6 COMPARE_OP               1 (<=)
              9 JUMP_IF_FALSE            8 (to 20)
             12 POP_TOP

  3          13 LOAD_CONST               2 (1)
             16 RETURN_VALUE
             17 JUMP_FORWARD             1 (to 21)
        >>   20 POP_TOP

  4     >>   21 LOAD_GLOBAL              0 (fibo)
             24 LOAD_FAST                0 (n)
             27 LOAD_CONST               2 (1)
             30 BINARY_SUBTRACT
             31 CALL_FUNCTION            1
             34 LOAD_GLOBAL              0 (fibo)
             37 LOAD_FAST                0 (n)
             40 LOAD_CONST               1 (2)
             43 BINARY_SUBTRACT
             44 CALL_FUNCTION            1
             47 BINARY_ADD
             48 RETURN_VALUE

Of the opcodes present here, some are indeed simple (like LOAD_CONST and
JUMP_FORWARD, etc). However, most of them perform potentially complex
operations and a disproportionate amount of time is spent in them. E.g.
COMPARE_OP, BINARY_SUBTRACT, BINARY_ADD, CALL_FUNCTION
are implemented using special methods, so are essentially method calls,
which need a full method lookup (which can take a lot of time, given
complex class hierarchies). Then the method can be implemented in Python
which leads to even more overhead (yes, one could try to speed up common
cases like integers for binary operations, but those turn out to not be
all that common in normal Python code).

Even a seemingly harmless opcode like LOAD_GLOBAL is rather complex to
implement. First you need to check the global dictionary and then
potentially the builtins dictionary. I guess to be really conclusive one
would have to do measurements about which bytecodes take the most time
at runtime. However, there is some evidence from PyPy that what I said
is right: Implementing a method cache gave PyPy much larger speedups
than any of the experiments we ever did for speeding up dispatching.

For these reasons I argue that the paper you keep citing vastly
exaggerates the effect of bytecode dispatch for Python, and I would
guess also for other dynamic languages. I am entering pure speculation
here, but I could guess that the reason the Perl interpreter appears so
inefficient in the paper is because it has similar characteristics.
Being a bit mean, the paper seems to be a bit of a self-fulfilling
prophecy: "Efficient interpreters are those that have a large number of
indirect branches.  Therefore, efficient interpreters will be helped by
our techniques. If our techniques don't help, well, then the interpreter
just wasn't efficient enough in the first place."

Note that I am not saying that techniques like a threaded interpreter
loop cannot lead to a faster Python interpreter. I am pretty sure that
they can help. My main point is that they might not help as much as you
seem to hope, and that one needs additional things to get really great
performance results (and with great I mean basically anything that is
more than an order of magnitude better).

[snip]

Cheers,

Carl Friedrich