Helping with STM at the PyCon 2013 (Santa Clara) sprints
From a recent email of Armin's to the list: The STM project progressed slowly during the last few months. The status right now is:
* Most importantly, missing major Garbage Collection cycles, which means pypy-stm slowly but constantly leaks memory.
* The JIT integration is not finished; so far pypy-stm can only be compiled without the JIT.
* There are also other places where the performance can be improved, probably a lot.
* Finally there are a number of usability concerns that we (or mostly Remi) worked on recently. The main issues turn around the idea that, as a user of pypy-stm, you should have a way to get freeback on the process. For example right now, transactions that abort are completely transparent --- to the point that you don't have any way to know that it occurred, apart from "it runs too slowly" if it occurs a lot. You should have a way to get Python tracebacks of aborts if you want to. A similar issue is "inevitable" transactions.
From the list above, are there any particular areas (tickets?) that would be a good starting place for me to look at? I expect that to get
I'm interested in helping with STM: 1) I think STM is really interesting, particularly Armin's take on it 2) I need a "Personal Development Goal" for %(dayjob)s. Last year it was just "Contribute to PyPy", which I did at the sprints (a bit). This year, I'd like to try something a bit more ambitious. ;) the most out of the sprints, I should do a bit of pre-work (reading at least, if not poking). Thanks! -- taa /*eof*/
Hi Taavi, On Wed, Feb 13, 2013 at 4:56 PM, Taavi Burns <taavi.burns@gmail.com> wrote:
From the list above, are there any particular areas (tickets?) that would be a good starting place for me to look at?
I can't just give you a specific task to do, but you can try to understand what is here so far. Look at the branch "stm-thread-2" on the pypy repository; e.g. try to translate with "rpython -O2 --stm targetpypystandalone". This gives you a kind-of-GIL-less PyPy. Try to use the transaction module ("import transaction") on some demo programs. Then I suppose you should dive into the mess that is multithreaded programming by looking in depth at lib_pypy/transaction.py. And this is all before diving into the PyPy sources themselves... You may also look at the work done by Remi Meier on his own separate repository (https://bitbucket.org/Raemi/pypy-stm-logging). It contains mostly playing around with various ideas that haven't been integrated back, or not yet. A bientôt, Armin.
That sounds like a reasonable place to start, thanks! I tried running the translation, and immediately hit what looks like a failure from merging in default, due to the pypy/rpython move. I've got a patch currently pushing to bitbucket, but it'll be a few minutes (pushing ~10 months of pypy dev effort). It'll be at https://bitbucket.org/taavi_burns/pypy/commits/0378c78cc316 when the push finishes. :) The translation still eventually fails, though: [translation:ERROR] File "../../rpython/translator/stm/jitdriver.py", line 86, in check_jitdriver [translation:ERROR] assert not jitdriver.autoreds # XXX [translation:ERROR] AssertionError Full stack and software versions: https://gist.github.com/taavi/4949322 Any ideas? Thanks! On Wed, Feb 13, 2013 at 11:16 AM, Armin Rigo <arigo@tunes.org> wrote:
Hi Taavi,
On Wed, Feb 13, 2013 at 4:56 PM, Taavi Burns <taavi.burns@gmail.com> wrote:
From the list above, are there any particular areas (tickets?) that would be a good starting place for me to look at?
I can't just give you a specific task to do, but you can try to understand what is here so far. Look at the branch "stm-thread-2" on the pypy repository; e.g. try to translate with "rpython -O2 --stm targetpypystandalone". This gives you a kind-of-GIL-less PyPy. Try to use the transaction module ("import transaction") on some demo programs. Then I suppose you should dive into the mess that is multithreaded programming by looking in depth at lib_pypy/transaction.py. And this is all before diving into the PyPy sources themselves...
You may also look at the work done by Remi Meier on his own separate repository (https://bitbucket.org/Raemi/pypy-stm-logging). It contains mostly playing around with various ideas that haven't been integrated back, or not yet.
A bientôt,
Armin.
-- taa /*eof*/
Hi Taavi, I finally fixed pypy-stm with signals. Now I'm getting again results that scale with the number of processors. Note that it stops scaling up at some point, around 4 or 6 threads, on machines I tried it on. I suspect it's related to the fact that physical processors have 4 or 6 cores internally, but the results are still a bit inconsistent. Using the "taskset" command to force the threads to run on particular physical sockets seems to help a little bit with some numbers. Fwiw, I got the maximum throughput on a 24-cores machine by really running 24 threads, but that seems wasteful, as it is only 25% better than running 6 threads on one physical socket. The next step will be trying to reduce the overhead, currently considerable (about 10x slower than CPython, too much to ever have any net benefit). Also high on the list is fixing the constant memory leak (i.e. implementing major garbage collection steps). A bientôt, Armin.
That's great, thanks! I did get it to work when you wrote earlier, but it's definitely faster now. I tried a ridiculously simple and no-conflict parallel program and came up with this, which gave me some questionable performance numbers from a build of 65ec96e15463: taavi@pypy:~/pypy/pypy/goal$ ./pypy-c -m timeit -s 'import transaction; transaction.set_num_threads(1)' ' def foo(): x = 0 for y in range(100000): x += y transaction.add(foo) transaction.add(foo) transaction.run()' 10 loops, best of 3: 198 msec per loop taavi@pypy:~/pypy/pypy/goal$ ./pypy-c -m timeit -s 'import transaction; transaction.set_num_threads(2)' ' def foo(): x = 0 for y in range(100000): x += y transaction.add(foo) transaction.add(foo) transaction.run()' 10 loops, best of 3: 415 msec per loop It's entirely possible that this is an effect of running inside a VMWare guest (set to use 2 cores) running on my Core2Duo laptop. If this is the case, I'll refrain from trying to do anything remotely like benchmarking in this environment in the future. :) Would it be more helpful (if I want to contribute to STM) to use something like a high-CPU EC2 instance, or should I look at obtaining something like an 8-real-core AMD X8? (my venerable X2 has started to disagree with its RAM, so it's prime for retirement) Thanks! On Sun, Feb 17, 2013 at 3:58 AM, Armin Rigo <arigo@tunes.org> wrote:
Hi Taavi,
I finally fixed pypy-stm with signals. Now I'm getting again results that scale with the number of processors.
Note that it stops scaling up at some point, around 4 or 6 threads, on machines I tried it on. I suspect it's related to the fact that physical processors have 4 or 6 cores internally, but the results are still a bit inconsistent. Using the "taskset" command to force the threads to run on particular physical sockets seems to help a little bit with some numbers. Fwiw, I got the maximum throughput on a 24-cores machine by really running 24 threads, but that seems wasteful, as it is only 25% better than running 6 threads on one physical socket.
The next step will be trying to reduce the overhead, currently considerable (about 10x slower than CPython, too much to ever have any net benefit). Also high on the list is fixing the constant memory leak (i.e. implementing major garbage collection steps).
A bientôt,
Armin.
-- taa /*eof*/
I got frustrated with my (actually dying now) local box and signed up for AWS. Using an m1.medium instance to build pypy (~100 minutes), and then upgrading it to a c1.xlarge (claims to be 8 virtual cores of 2.5 ECU each). With the same sample program, I see the expected kinds of speedups! :D So using VMWare is right out. Hopefully that info is useful for someone else in the future. :) On Sun, Feb 17, 2013 at 6:38 PM, Taavi Burns <taavi.burns@gmail.com> wrote:
That's great, thanks! I did get it to work when you wrote earlier, but it's definitely faster now.
I tried a ridiculously simple and no-conflict parallel program and came up with this, which gave me some questionable performance numbers from a build of 65ec96e15463:
taavi@pypy:~/pypy/pypy/goal$ ./pypy-c -m timeit -s 'import transaction; transaction.set_num_threads(1)' ' def foo(): x = 0 for y in range(100000): x += y transaction.add(foo) transaction.add(foo) transaction.run()' 10 loops, best of 3: 198 msec per loop
taavi@pypy:~/pypy/pypy/goal$ ./pypy-c -m timeit -s 'import transaction; transaction.set_num_threads(2)' ' def foo(): x = 0 for y in range(100000): x += y transaction.add(foo) transaction.add(foo) transaction.run()' 10 loops, best of 3: 415 msec per loop
It's entirely possible that this is an effect of running inside a VMWare guest (set to use 2 cores) running on my Core2Duo laptop. If this is the case, I'll refrain from trying to do anything remotely like benchmarking in this environment in the future. :)
Would it be more helpful (if I want to contribute to STM) to use something like a high-CPU EC2 instance, or should I look at obtaining something like an 8-real-core AMD X8?
(my venerable X2 has started to disagree with its RAM, so it's prime for retirement)
Thanks!
On Sun, Feb 17, 2013 at 3:58 AM, Armin Rigo <arigo@tunes.org> wrote:
Hi Taavi,
I finally fixed pypy-stm with signals. Now I'm getting again results that scale with the number of processors.
Note that it stops scaling up at some point, around 4 or 6 threads, on machines I tried it on. I suspect it's related to the fact that physical processors have 4 or 6 cores internally, but the results are still a bit inconsistent. Using the "taskset" command to force the threads to run on particular physical sockets seems to help a little bit with some numbers. Fwiw, I got the maximum throughput on a 24-cores machine by really running 24 threads, but that seems wasteful, as it is only 25% better than running 6 threads on one physical socket.
The next step will be trying to reduce the overhead, currently considerable (about 10x slower than CPython, too much to ever have any net benefit). Also high on the list is fixing the constant memory leak (i.e. implementing major garbage collection steps).
A bientôt,
Armin.
-- taa /*eof*/
-- taa /*eof*/
participants (2)
-
Armin Rigo
-
Taavi Burns