performance problems with Krakatau

Hi, I have been trying to use Pypy to speed up the Krakatau decompiler ( https://github.com/Storyyeller/Krakatau). It is a large, pure Python application with several compute intensive parts, so I thought it would work well. Unfortunately, there is no clear speedup, and Pypy requires several times as much memory as well, making it unusual for larger inputs. For example, decompiling a quarter of ASM, I got the following results (execution time, memory usage) cpython 64 - 62.5s, 102.6kb cpython 32 - 69.2s, 54.5kb pypy 2.1.0 - 106.5s, 277.8kb pypy 2.2.1 - 109.2s, 194.6kb Sometimes, 2.2.1 is faster than 2.1.0, but they're both clearly much worse than CPython. These tests were performed on Windows 7 64bit using the prebuilt 32bit binaries of Pypy. I tested the 32bit version of CPython too, to see if the problem was a lack of 64bit support. However, CPython 32bit also vastly outperformed Pypy. Execution time was measured using time.time(). Memory usage was measured by watching the Windows Resource Manager and recording the peak Private value. Similar patterns were seen in Working Set, etc. I thought the increased memory usage at least might be explained by constant overhead from compiled code or from it not running long enough to trigger full garbage collection. However, Pypy continues to use several times as much memory on much larger examples. Does anyone know what could be going on here? Pypy isn't normally slower than CPython. Is there a way for me to tell what the problem is?

On Tue, Jan 14, 2014 at 11:38 PM, Robert Grosse <n210241048576@gmail.com> wrote:
Hi, I have been trying to use Pypy to speed up the Krakatau decompiler (https://github.com/Storyyeller/Krakatau). It is a large, pure Python application with several compute intensive parts, so I thought it would work well. Unfortunately, there is no clear speedup, and Pypy requires several times as much memory as well, making it unusual for larger inputs.
For example, decompiling a quarter of ASM, I got the following results (execution time, memory usage)
cpython 64 - 62.5s, 102.6kb cpython 32 - 69.2s, 54.5kb pypy 2.1.0 - 106.5s, 277.8kb pypy 2.2.1 - 109.2s, 194.6kb
Sometimes, 2.2.1 is faster than 2.1.0, but they're both clearly much worse than CPython.
These tests were performed on Windows 7 64bit using the prebuilt 32bit binaries of Pypy. I tested the 32bit version of CPython too, to see if the problem was a lack of 64bit support. However, CPython 32bit also vastly outperformed Pypy.
Execution time was measured using time.time(). Memory usage was measured by watching the Windows Resource Manager and recording the peak Private value. Similar patterns were seen in Working Set, etc.
I thought the increased memory usage at least might be explained by constant overhead from compiled code or from it not running long enough to trigger full garbage collection. However, Pypy continues to use several times as much memory on much larger examples.
Does anyone know what could be going on here? Pypy isn't normally slower than CPython. Is there a way for me to tell what the problem is?
Hi. It depends on your workload a lot. If you want us to have a look into it, you need to provide a clear and reproducible way to run a benchmark. Cheers, fijal

What would be the best way to provide this? On Wed, Jan 15, 2014 at 8:39 AM, Maciej Fijalkowski <fijall@gmail.com>wrote:
Hi, I have been trying to use Pypy to speed up the Krakatau decompiler (https://github.com/Storyyeller/Krakatau). It is a large, pure Python application with several compute intensive parts, so I thought it would work well. Unfortunately, there is no clear speedup, and Pypy requires several times as much memory as well, making it unusual for larger inputs.
For example, decompiling a quarter of ASM, I got the following results (execution time, memory usage)
cpython 64 - 62.5s, 102.6kb cpython 32 - 69.2s, 54.5kb pypy 2.1.0 - 106.5s, 277.8kb pypy 2.2.1 - 109.2s, 194.6kb
Sometimes, 2.2.1 is faster than 2.1.0, but they're both clearly much worse than CPython.
These tests were performed on Windows 7 64bit using the prebuilt 32bit binaries of Pypy. I tested the 32bit version of CPython too, to see if
On Tue, Jan 14, 2014 at 11:38 PM, Robert Grosse <n210241048576@gmail.com> wrote: the
problem was a lack of 64bit support. However, CPython 32bit also vastly outperformed Pypy.
Execution time was measured using time.time(). Memory usage was measured by watching the Windows Resource Manager and recording the peak Private value. Similar patterns were seen in Working Set, etc.
I thought the increased memory usage at least might be explained by constant overhead from compiled code or from it not running long enough to trigger full garbage collection. However, Pypy continues to use several times as much memory on much larger examples.
Does anyone know what could be going on here? Pypy isn't normally slower than CPython. Is there a way for me to tell what the problem is?
Hi.
It depends on your workload a lot. If you want us to have a look into it, you need to provide a clear and reproducible way to run a benchmark.
Cheers, fijal

On Wed, Jan 15, 2014 at 2:56 PM, Robert Grosse <n210241048576@gmail.com> wrote:
What would be the best way to provide this?
link to download would work. If you don't have a hosting space, just send it to me privately in email, I'll put it somewhere (or use dropbox or whatever). *Clear* instructions are crucial though, don't assume we know how to install or use a random package. Cheers, fijal

I've uploaded it to Mediafire. If that doesn't work, I can find some other place. https://www.mediafire.com/?eylpn794d9peu19 Included is the jar I used for the tests, a diff against the current head and instructions. The diff just removes the deleteUnused call, which I thought might skew results, and changes it to only decompile 1/4 of the files so it doesn't take as long. Instructions: Make sure python 2.7 is installed clone and checkout commit b0929533ffa0bb2b6b5bb55fc4f38da2ab85a870 from https://github.com/Storyyeller/Krakatau.git apply diff (this should just change two lines in decompile.py) create a directory named temp Assuming you cloned Krakatau to the directory Krakatau and the asm jar is in the current directory, run python -i Krakatau\decompile.py -out temp asm-debug-all-4.1.jar For pypy, it would of course instead be pypy -i Krakatau\decompile.py -out temp asm-debug-all-4.1.jar On Wed, Jan 15, 2014 at 9:02 AM, Maciej Fijalkowski <fijall@gmail.com>wrote:
On Wed, Jan 15, 2014 at 2:56 PM, Robert Grosse <n210241048576@gmail.com> wrote:
What would be the best way to provide this?
link to download would work. If you don't have a hosting space, just send it to me privately in email, I'll put it somewhere (or use dropbox or whatever). *Clear* instructions are crucial though, don't assume we know how to install or use a random package.
Cheers, fijal

On Wed, Jan 15, 2014 at 3:21 PM, Robert Grosse <n210241048576@gmail.com> wrote:
I've uploaded it to Mediafire. If that doesn't work, I can find some other place. https://www.mediafire.com/?eylpn794d9peu19
Included is the jar I used for the tests, a diff against the current head and instructions. The diff just removes the deleteUnused call, which I thought might skew results, and changes it to only decompile 1/4 of the files so it doesn't take as long.
Instructions: Make sure python 2.7 is installed
clone and checkout commit b0929533ffa0bb2b6b5bb55fc4f38da2ab85a870 from https://github.com/Storyyeller/Krakatau.git apply diff (this should just change two lines in decompile.py)
create a directory named temp
Assuming you cloned Krakatau to the directory Krakatau and the asm jar is in the current directory, run python -i Krakatau\decompile.py -out temp asm-debug-all-4.1.jar
For pypy, it would of course instead be pypy -i Krakatau\decompile.py -out temp asm-debug-all-4.1.jar
On Wed, Jan 15, 2014 at 9:02 AM, Maciej Fijalkowski <fijall@gmail.com> wrote:
On Wed, Jan 15, 2014 at 2:56 PM, Robert Grosse <n210241048576@gmail.com> wrote:
What would be the best way to provide this?
link to download would work. If you don't have a hosting space, just send it to me privately in email, I'll put it somewhere (or use dropbox or whatever). *Clear* instructions are crucial though, don't assume we know how to install or use a random package.
Cheers, fijal
hey, I get ClassNotFoundException: java/lang/Object, full traceback http://paste.pound-python.org/show/n5D7PDoyegfohmJFKOnu/

Oh sorry, I forgot about that. You need to find the rt.jar from your Java installation and pass the path on the command line. For example, if it's located in C:\Program Files\Java\jre7\lib, you could do python -i Krakatau\decompile.py -out temp asm-debug-all-4.1.jar -path "C:\Program Files\Java\jre7\lib\rt.jar" Obviously on Linux it will be somewhere else. It shouldn't really matter which version of Java you have since the standard library is pretty stable.. On Wed, Jan 15, 2014 at 11:34 AM, Maciej Fijalkowski <fijall@gmail.com>wrote:
On Wed, Jan 15, 2014 at 3:21 PM, Robert Grosse <n210241048576@gmail.com> wrote:
I've uploaded it to Mediafire. If that doesn't work, I can find some other place. https://www.mediafire.com/?eylpn794d9peu19
Included is the jar I used for the tests, a diff against the current head and instructions. The diff just removes the deleteUnused call, which I thought might skew results, and changes it to only decompile 1/4 of the files so it doesn't take as long.
Instructions: Make sure python 2.7 is installed
clone and checkout commit b0929533ffa0bb2b6b5bb55fc4f38da2ab85a870 from https://github.com/Storyyeller/Krakatau.git apply diff (this should just change two lines in decompile.py)
create a directory named temp
Assuming you cloned Krakatau to the directory Krakatau and the asm jar is in the current directory, run python -i Krakatau\decompile.py -out temp asm-debug-all-4.1.jar
For pypy, it would of course instead be pypy -i Krakatau\decompile.py -out temp asm-debug-all-4.1.jar
On Wed, Jan 15, 2014 at 9:02 AM, Maciej Fijalkowski <fijall@gmail.com> wrote:
On Wed, Jan 15, 2014 at 2:56 PM, Robert Grosse <n210241048576@gmail.com
wrote:
What would be the best way to provide this?
link to download would work. If you don't have a hosting space, just send it to me privately in email, I'll put it somewhere (or use dropbox or whatever). *Clear* instructions are crucial though, don't assume we know how to install or use a random package.
Cheers, fijal
hey, I get ClassNotFoundException: java/lang/Object, full traceback http://paste.pound-python.org/show/n5D7PDoyegfohmJFKOnu/

On Wed, Jan 15, 2014 at 7:20 PM, Robert Grosse <n210241048576@gmail.com> wrote:
Oh sorry, I forgot about that.
You need to find the rt.jar from your Java installation and pass the path on the command line. For example, if it's located in C:\Program Files\Java\jre7\lib, you could do python -i Krakatau\decompile.py -out temp asm-debug-all-4.1.jar -path "C:\Program Files\Java\jre7\lib\rt.jar" Obviously on Linux it will be somewhere else. It shouldn't really matter which version of Java you have since the standard library is pretty stable..
Thanks, I'm looking into it. Would you mind if we add Krakatau as a benchmark for our nightlies?

Hi Robert. This is going to be a long mail, so bear with me :) The first take away is that pypy warmup is atrocious (that's unimpressive, but you might be delighted to hear I'm working on it right now, except I'm writing this mail). It's quite a bit of work, so it might or might not make it to the next pypy release. We also don't know how well it'll work. The runs that I have now, when running 3 times in the same process look like this (this includes other improvements mentioned later): 46s 32s 29s (cpython takes always 29s) Now, this is far from ideal and we're working on making it better (in fact it's a very useful benchmark), but I can pinpoint some stuff that we will fix and some stuff we won't fix in the near future. One thing that I've already fixed today is loops over tuple when doing x in tuple (so tuple.__contains__). One of the problems with this code is that I don't think it's very efficient. While that's not a good reason to be slower than cpython, it gives you an upper bound on what can be optimized away. Example (from java/structuring.py): new = new if old is None else tuple(x for x in old if x in new) now note that this has a complexity of O(n^2), because you're iterating for all of the one tuple and then for each over all of the elements of the other tuple. Another example: return [x for x in zip(*map(self._doms.get, nodes)) if len(set(x))==1][-1][0] this creates quite a few lists, while all it wants to do is to grab the last one. Those tiny loops are found a bit everywhere. I think more consistent data structures will make it a lot faster on both CPython and PyPy.
From our side, we'll improve generator iterators today and warmup some time in the not-so-near future.
Speaking of which - memory consumptions is absolutely atrocious. It's a combination of JIT using too much memory, generator iterators not being cleaned correctly *and* some bug that prevents JIT loops from being freed. we'll deal with all of it, give us some time (that said, the memory consumption *will* be bigger than cpython, but hopefully by not that much). I'm sorry I can't help you as much as I wanted Cheers, fijal On Thu, Jan 16, 2014 at 10:50 AM, Maciej Fijalkowski <fijall@gmail.com> wrote:
On Wed, Jan 15, 2014 at 7:20 PM, Robert Grosse <n210241048576@gmail.com> wrote:
Oh sorry, I forgot about that.
You need to find the rt.jar from your Java installation and pass the path on the command line. For example, if it's located in C:\Program Files\Java\jre7\lib, you could do python -i Krakatau\decompile.py -out temp asm-debug-all-4.1.jar -path "C:\Program Files\Java\jre7\lib\rt.jar" Obviously on Linux it will be somewhere else. It shouldn't really matter which version of Java you have since the standard library is pretty stable..
Thanks, I'm looking into it. Would you mind if we add Krakatau as a benchmark for our nightlies?

Hi, thanks for looking into it! Feel free to use it as a benchmark. I'll also look into the problems you mentioned to see if I can make future versions of Krakatau faster. On Thu, Jan 16, 2014 at 8:51 AM, Maciej Fijalkowski <fijall@gmail.com>wrote:
Hi Robert.
This is going to be a long mail, so bear with me :)
The first take away is that pypy warmup is atrocious (that's unimpressive, but you might be delighted to hear I'm working on it right now, except I'm writing this mail). It's quite a bit of work, so it might or might not make it to the next pypy release. We also don't know how well it'll work.
The runs that I have now, when running 3 times in the same process look like this (this includes other improvements mentioned later):
46s 32s 29s (cpython takes always 29s)
Now, this is far from ideal and we're working on making it better (in fact it's a very useful benchmark), but I can pinpoint some stuff that we will fix and some stuff we won't fix in the near future. One thing that I've already fixed today is loops over tuple when doing x in tuple (so tuple.__contains__).
One of the problems with this code is that I don't think it's very efficient. While that's not a good reason to be slower than cpython, it gives you an upper bound on what can be optimized away. Example (from java/structuring.py):
new = new if old is None else tuple(x for x in old if x in new)
now note that this has a complexity of O(n^2), because you're iterating for all of the one tuple and then for each over all of the elements of the other tuple.
Another example:
return [x for x in zip(*map(self._doms.get, nodes)) if len(set(x))==1][-1][0]
this creates quite a few lists, while all it wants to do is to grab the last one.
Those tiny loops are found a bit everywhere. I think more consistent data structures will make it a lot faster on both CPython and PyPy.
From our side, we'll improve generator iterators today and warmup some time in the not-so-near future.
Speaking of which - memory consumptions is absolutely atrocious. It's a combination of JIT using too much memory, generator iterators not being cleaned correctly *and* some bug that prevents JIT loops from being freed. we'll deal with all of it, give us some time (that said, the memory consumption *will* be bigger than cpython, but hopefully by not that much).
I'm sorry I can't help you as much as I wanted
Cheers, fijal
On Wed, Jan 15, 2014 at 7:20 PM, Robert Grosse <n210241048576@gmail.com> wrote:
Oh sorry, I forgot about that.
You need to find the rt.jar from your Java installation and pass the
On Thu, Jan 16, 2014 at 10:50 AM, Maciej Fijalkowski <fijall@gmail.com> wrote: path on
the command line. For example, if it's located in C:\Program Files\Java\jre7\lib, you could do python -i Krakatau\decompile.py -out temp asm-debug-all-4.1.jar -path "C:\Program Files\Java\jre7\lib\rt.jar" Obviously on Linux it will be somewhere else. It shouldn't really matter which version of Java you have since the standard library is pretty stable..
Thanks, I'm looking into it. Would you mind if we add Krakatau as a benchmark for our nightlies?

Hi again, I recently updated Pypy to (pypy-c-jit-70483-2d8eaa5f5079-win32), and Pypy's performance is much better now. I also addressed the previously mentioned issues in Krakatau so it is faster on both CPython and Pypy. However, I have noticed that there are still some cases in which CPython outperforms Pypy. I created a benchmark using one class I noticed with the biggest discrepancy https://github.com/Storyyeller/Krakatau.git commit 88a5a24deb3a8e6d0d92aca2052ea1db6a7274a0 You can run it via python Krakatau\benchmark.py -path whatever\rt.jar where you pass the path to your JRE's rt.jar as appropriate This benchmark is based on decompiling a single class, sun/text/normalizer/Utility from the JRE. The benchmark decompiles the class 40 times beforehand to warmup the jit and then measures the time taken to decompile it 200 times using time.time(). I recorded memory usage manually via the Windows Task Manager using Peak Working Set. I used the Java 7u51 JRE, but I expect any version to be the same as I doubt the class changed much. CPython: 202.8 seconds, 47.5mb Pypy: 284.3 seconds, 229.2mb The memory usage isn't too concerning to me, since I imagine that a JIT has higher fixed overhead, but I find it strange that CPython also executes faster for this class, since it is all pure Python CPU bound computation. On Thu, Jan 16, 2014 at 5:51 AM, Maciej Fijalkowski <fijall@gmail.com>wrote:
Hi Robert.
This is going to be a long mail, so bear with me :)
The first take away is that pypy warmup is atrocious (that's unimpressive, but you might be delighted to hear I'm working on it right now, except I'm writing this mail). It's quite a bit of work, so it might or might not make it to the next pypy release. We also don't know how well it'll work.
The runs that I have now, when running 3 times in the same process look like this (this includes other improvements mentioned later):
46s 32s 29s (cpython takes always 29s)
Now, this is far from ideal and we're working on making it better (in fact it's a very useful benchmark), but I can pinpoint some stuff that we will fix and some stuff we won't fix in the near future. One thing that I've already fixed today is loops over tuple when doing x in tuple (so tuple.__contains__).
One of the problems with this code is that I don't think it's very efficient. While that's not a good reason to be slower than cpython, it gives you an upper bound on what can be optimized away. Example (from java/structuring.py):
new = new if old is None else tuple(x for x in old if x in new)
now note that this has a complexity of O(n^2), because you're iterating for all of the one tuple and then for each over all of the elements of the other tuple.
Another example:
return [x for x in zip(*map(self._doms.get, nodes)) if len(set(x))==1][-1][0]
this creates quite a few lists, while all it wants to do is to grab the last one.
Those tiny loops are found a bit everywhere. I think more consistent data structures will make it a lot faster on both CPython and PyPy.
From our side, we'll improve generator iterators today and warmup some time in the not-so-near future.
Speaking of which - memory consumptions is absolutely atrocious. It's a combination of JIT using too much memory, generator iterators not being cleaned correctly *and* some bug that prevents JIT loops from being freed. we'll deal with all of it, give us some time (that said, the memory consumption *will* be bigger than cpython, but hopefully by not that much).
I'm sorry I can't help you as much as I wanted
Cheers, fijal
On Wed, Jan 15, 2014 at 7:20 PM, Robert Grosse <n210241048576@gmail.com> wrote:
Oh sorry, I forgot about that.
You need to find the rt.jar from your Java installation and pass the
On Thu, Jan 16, 2014 at 10:50 AM, Maciej Fijalkowski <fijall@gmail.com> wrote: path on
the command line. For example, if it's located in C:\Program Files\Java\jre7\lib, you could do python -i Krakatau\decompile.py -out temp asm-debug-all-4.1.jar -path "C:\Program Files\Java\jre7\lib\rt.jar" Obviously on Linux it will be somewhere else. It shouldn't really matter which version of Java you have since the standard library is pretty stable..
Thanks, I'm looking into it. Would you mind if we add Krakatau as a benchmark for our nightlies?

it seems to be a major issue with our specialization, where we compile tons and tons of small bridges that bring no value. I commited little bit of improvements, but it's clearly not done yet. I also did small adjustements to use small instances instead of dicts in cases where you have a small set of well known keys: http://paste.pound-python.org/show/Rh7p0uP3Is8C8VA6LyNN/ I'll look into it some more, thanks for a useful benchmark. On Sun, Apr 13, 2014 at 5:10 AM, Robert Grosse <n210241048576@gmail.com> wrote:
Hi again,
I recently updated Pypy to (pypy-c-jit-70483-2d8eaa5f5079-win32), and Pypy's performance is much better now. I also addressed the previously mentioned issues in Krakatau so it is faster on both CPython and Pypy. However, I have noticed that there are still some cases in which CPython outperforms Pypy.
I created a benchmark using one class I noticed with the biggest discrepancy
https://github.com/Storyyeller/Krakatau.git commit 88a5a24deb3a8e6d0d92aca2052ea1db6a7274a0
You can run it via python Krakatau\benchmark.py -path whatever\rt.jar where you pass the path to your JRE's rt.jar as appropriate
This benchmark is based on decompiling a single class, sun/text/normalizer/Utility from the JRE. The benchmark decompiles the class 40 times beforehand to warmup the jit and then measures the time taken to decompile it 200 times using time.time(). I recorded memory usage manually via the Windows Task Manager using Peak Working Set. I used the Java 7u51 JRE, but I expect any version to be the same as I doubt the class changed much.
CPython: 202.8 seconds, 47.5mb Pypy: 284.3 seconds, 229.2mb
The memory usage isn't too concerning to me, since I imagine that a JIT has higher fixed overhead, but I find it strange that CPython also executes faster for this class, since it is all pure Python CPU bound computation.
On Thu, Jan 16, 2014 at 5:51 AM, Maciej Fijalkowski <fijall@gmail.com> wrote:
Hi Robert.
This is going to be a long mail, so bear with me :)
The first take away is that pypy warmup is atrocious (that's unimpressive, but you might be delighted to hear I'm working on it right now, except I'm writing this mail). It's quite a bit of work, so it might or might not make it to the next pypy release. We also don't know how well it'll work.
The runs that I have now, when running 3 times in the same process look like this (this includes other improvements mentioned later):
46s 32s 29s (cpython takes always 29s)
Now, this is far from ideal and we're working on making it better (in fact it's a very useful benchmark), but I can pinpoint some stuff that we will fix and some stuff we won't fix in the near future. One thing that I've already fixed today is loops over tuple when doing x in tuple (so tuple.__contains__).
One of the problems with this code is that I don't think it's very efficient. While that's not a good reason to be slower than cpython, it gives you an upper bound on what can be optimized away. Example (from java/structuring.py):
new = new if old is None else tuple(x for x in old if x in new)
now note that this has a complexity of O(n^2), because you're iterating for all of the one tuple and then for each over all of the elements of the other tuple.
Another example:
return [x for x in zip(*map(self._doms.get, nodes)) if len(set(x))==1][-1][0]
this creates quite a few lists, while all it wants to do is to grab the last one.
Those tiny loops are found a bit everywhere. I think more consistent data structures will make it a lot faster on both CPython and PyPy.
From our side, we'll improve generator iterators today and warmup some time in the not-so-near future.
Speaking of which - memory consumptions is absolutely atrocious. It's a combination of JIT using too much memory, generator iterators not being cleaned correctly *and* some bug that prevents JIT loops from being freed. we'll deal with all of it, give us some time (that said, the memory consumption *will* be bigger than cpython, but hopefully by not that much).
I'm sorry I can't help you as much as I wanted
Cheers, fijal
On Thu, Jan 16, 2014 at 10:50 AM, Maciej Fijalkowski <fijall@gmail.com> wrote:
On Wed, Jan 15, 2014 at 7:20 PM, Robert Grosse <n210241048576@gmail.com> wrote:
Oh sorry, I forgot about that.
You need to find the rt.jar from your Java installation and pass the path on the command line. For example, if it's located in C:\Program Files\Java\jre7\lib, you could do python -i Krakatau\decompile.py -out temp asm-debug-all-4.1.jar -path "C:\Program Files\Java\jre7\lib\rt.jar" Obviously on Linux it will be somewhere else. It shouldn't really matter which version of Java you have since the standard library is pretty stable..
Thanks, I'm looking into it. Would you mind if we add Krakatau as a benchmark for our nightlies?
participants (2)
-
Maciej Fijalkowski
-
Robert Grosse