trying out STM for some numbers on more cores
Hi, last week I inherited a dual 8 core E5-2650 box, so 16 physical cores. For fun I wanted to redo the STM numbers from the blog posting on more cores. I killed the printout coming from rpython, as besides giving unnecessary syncing, it also causes crashes beyond 12 threads or so. Here are my numbers: Reference CPython, 1 thread: 271.17 Reference PyPy-2.1, 1 thread: 9.63 (Note above for whatever reason my CPython run is slower that in the blog post, even as the pypy run is a lot faster?) Freshly-translated (1385fb758727+ (stmgc-c4)): richards.py, 100 iterations each, printout from rpython removed: Threads avg. time speedup 1 358.48 1.0 2 244.86 1.5 4 162.87 2.2 8 141.48 2.5 16 127.43 2.8 32 146.57 2.4 Note that the speedup numbers above are less than what was posted on the blog, and scaling with hyperthreads isn't working well. Has any attempt been made to pin threads? Going beyond 32 crashes in: #1 0x0000000000d29026 in pypy_g_start_new_thread () #2 0x0000000000498347 in pypy_g_BuiltinActivation_UwS_ObjSpace_W_Root_W_Root_W_R () #3 0x000000000085a8c1 in pypy_g_BuiltinCode_funcrun_obj () #4 0x0000000000858255 in pypy_g_funccall_valuestack__AccessDirect_None () #5 0x0000000000d73d86 in pypy_g_CALL_METHOD__AccessDirect_star_1 () #6 0x000000000089d3a3 in pypy_g_dispatch_bytecode__AccessDirect_None () #7 0x00000000008a34b5 in pypy_g_handle_bytecode__AccessDirect_None () #8 0x0000000000cfe311 in pypy_g_dispatch__AccessDirect_None_stm () #9 0x000000000136f974 in pypy_g__stm_callback_4 () #10 0x00000000014ea1cf in stm_perform_transaction () #11 0x000000000050e62e in pypy_g_ll_invoke_stm__pypy_objspace_std_frame_StdObjSpa () #12 0x00000000011ec935 in pypy_g_ll_portal_runner__Unsigned_Bool_pypy_interpreter () #13 0x0000000000874bda in pypy_g_PyFrame_run () Still trying to see whether I can get PyPy to run on the MIC. :) Best regards, Wim -- WLavrijsen@lbl.gov -- +1 (510) 486 6411 -- www.lavrijsen.net
On 10/28/2013 05:58 PM, wlavrijsen@lbl.gov wrote:
Has any attempt been made to pin threads?
I don't know. But I do know that processor/thread binding (if that is what you mean by "pin") is *extremely* important in Sandy Bridge, even more than on previous archs. And, oddly enough, in my experience is more difficult to get it right than on previous arch. Unfortunately, the runtime I have most experience with is very different from the standard gcc-pthreads-fork which I believe would be relevant here, so my experience is useless in this context. Good luck finding the way to do this right, keep us posted! Dav
Hi Davide, On Tue, Oct 29, 2013 at 6:21 PM, Davide Del Vento <ddvento@ucar.edu> wrote:
I don't know. But I do know that processor/thread binding (if that is what you mean by "pin") is *extremely* important in Sandy Bridge, even more than on previous archs. And, oddly enough, in my experience is more difficult to get it right than on previous arch.
Can you point to more information about this? A bientôt, Armin.
Hi Armin, On 10/29/2013 03:54 PM, Armin Rigo wrote:
On Tue, Oct 29, 2013 at 6:21 PM, Davide Del Vento <ddvento@ucar.edu> wrote:
I don't know. But I do know that processor/thread binding (if that is what you mean by "pin") is *extremely* important in Sandy Bridge, even more than on previous archs. And, oddly enough, in my experience is more difficult to get it right than on previous arch.
Can you point to more information about this?
Well, as I said "in my experience". I can talk about it right here and some runs I did, if you so desire, but I can't point to anybody saying so "officially". I can point you to the following, which also shows that the runtime we use is pretty different (MPI and OpenMP, often from Intel or other proprietary compilers): https://www2.cisl.ucar.edu/resources/yellowstone/using_resources/runningjobs... Ciao, Davide
Davide,
I don't know. But I do know that processor/thread binding (if that is what you mean by "pin")
is what I meant. :) But a q&d implementation does not seem to make much difference other than for 8 and 16 threads, where it helps a bit. Running some more, I noticed that there are plenty of other overheads and the 'avg. time' doesn't get anywhere near stable until the number of iterations is in the 1000s (I used 100 before). iterations 16 threads 32 threads PyPy-2.1 100 127.43 146.57 9.63 200 77.59 86.37 7.80 500 46.92 49.12 6.82 1000 36.51 33.80 6.29 2000 32.18 28.69 6.40 The numbers are closer together, and HT now helps (note that the "slowdown" for 2000 iterations for 2.1 is not significant; I should run this multiple times and average, but this is just for fun). It is obvious, though, that overheads are larger for STM atm, and are therefore important for longer. The differences at larger number of iterations are much less for smaller numbers of threads (and zero for 1 thread). Intuitively that makes sense. It also says that 16 threads can give a 11x speedup if there's enough work to do. Best regards, Wim -- WLavrijsen@lbl.gov -- +1 (510) 486 6411 -- www.lavrijsen.net
Hi Wim, Thanks for posting your numbers. I think they are interesting and the 11x speedup for 16 threads is not bad, however the overhead of STM is still too high compared to PyPy. Maybe you need also a larger dataset, besides a longer time?
I should run this multiple times and average, but this is just for fun
I think for this purpose you need the best timing, not the average, especially if you are using a desktop/laptop. The best timing is something that happened and therefore "can happen". The average is affected by a variety of other things which may be running on your machine, which are better left out for this purpose (it's interesting to study them in order to see if one can eliminate them on a production environment, but that's a completely different job). Cheers, Davide
2013/10/30 Davide Del Vento <ddvento@ucar.edu>
Hi Wim,
Thanks for posting your numbers. I think they are interesting and the 11x speedup for 16 threads is not bad, however the overhead of STM is still too high compared to PyPy. Maybe you need also a larger dataset, besides a longer time?
I should run this multiple times and average, but this is just for fun
I think for this purpose you need the best timing, not the average, especially if you are using a desktop/laptop. The best timing is something that happened and therefore "can happen". The average is affected by a variety of other things which may be running on your machine, which are better left out for this purpose (it's interesting to study them in order to see if one can eliminate them on a production environment, but that's a completely different job).
Also, the process should perform 1000 iterations before you start the timings. The JIT needs a lot of iterations to be warm-up correctly. -- Amaury Forgeot d'Arc
Unless I missed something (possible!) the JIT and STM are mutually exclusive (until implemented). -- taa /*eof*/
On Oct 30, 2013, at 14:45, "Amaury Forgeot d'Arc" <amauryfa@gmail.com> wrote:
2013/10/30 Davide Del Vento <ddvento@ucar.edu>
Hi Wim,
Thanks for posting your numbers. I think they are interesting and the 11x speedup for 16 threads is not bad, however the overhead of STM is still too high compared to PyPy. Maybe you need also a larger dataset, besides a longer time?
I should run this multiple times and average, but this is just for fun
I think for this purpose you need the best timing, not the average, especially if you are using a desktop/laptop. The best timing is something that happened and therefore "can happen". The average is affected by a variety of other things which may be running on your machine, which are better left out for this purpose (it's interesting to study them in order to see if one can eliminate them on a production environment, but that's a completely different job).
Also, the process should perform 1000 iterations before you start the timings. The JIT needs a lot of iterations to be warm-up correctly.
-- Amaury Forgeot d'Arc _______________________________________________ pypy-dev mailing list pypy-dev@python.org https://mail.python.org/mailman/listinfo/pypy-dev
2013/10/30 Taavi Burns <taavi.burns@gmail.com>
Unless I missed something (possible!) the JIT and STM are mutually exclusive (until implemented).
In the last post about STM: http://morepypy.blogspot.ch/2013/10/update-on-stm.html "For comparison, disabling the JIT gives 492ms on PyPy-2.1 and 538ms on PyPy-STM." JIT on PyPy-STM gives only a slight improvement so far, but it will definitely slow down things a lot during the tracing phase. -- Amaury Forgeot d'Arc
Best news (new to me anyway!) I've read all day. :) Thanks! On Wed, Oct 30, 2013 at 4:20 PM, Amaury Forgeot d'Arc <amauryfa@gmail.com> wrote:
2013/10/30 Taavi Burns <taavi.burns@gmail.com>
Unless I missed something (possible!) the JIT and STM are mutually exclusive (until implemented).
In the last post about STM: http://morepypy.blogspot.ch/2013/10/update-on-stm.html "For comparison, disabling the JIT gives 492ms on PyPy-2.1 and 538ms on PyPy-STM."
JIT on PyPy-STM gives only a slight improvement so far, but it will definitely slow down things a lot during the tracing phase.
-- Amaury Forgeot d'Arc
-- taa /*eof*/
Amaury, On Wed, 30 Oct 2013, Amaury Forgeot d'Arc wrote:
Also, the process should perform 1000 iterations before you start the timings. The JIT needs a lot of iterations to be warm-up correctly.
so, each 'iteration' that I had in the table contains an inner loop that is itself JIT-ed (not verified, but just compare the PyPy-2.1 v.s. CPython numbers to see how well that works; likewise, if I switch off the JIT, the result is an xx slowdown compared to CPython). Thus, if '100 iterations,' the hurt should only be in the first iteration. Best regards, Wim -- WLavrijsen@lbl.gov -- +1 (510) 486 6411 -- www.lavrijsen.net
Davide,
Thanks for posting your numbers. I think they are interesting and the 11x speedup for 16 threads is not bad, however the overhead of STM is still too high compared to PyPy.
well, yes and no: richards.py runs 30x faster on PyPy than on CPython. The more typical speedup of PyPy is 5x, so If I can get an 11x speedup instead, I'm already pretty happy. In particular, my main interest is as always legacy C++. Remi's thesis shows how one can build higher level structs to control commits. Iow., I can offer the Python user thread-safe access to non-thread-safe C++ without forcing a use pattern on him. Now, if the bulk of the time spent is not in Python in the first place, the overhead may very well not be "too high." Needs to be proven, and of course the lower the overhead the better, but I'm rather optimistic. :)
I think for this purpose you need the best timing, not the average, especially if you are using a desktop/laptop.
Yes, I know: with Intel's version of OpenMP for example, affinity is set to threads on start, not on creation. Short benchmarks tend to be consistently off in individual runs when the assignment ends up unbalanced. The point was more that the last couple of digits in the timing, although printed, are largely noise. And thus that PyPy-2.1 didn't start slowing down for 2000 iterations as the numbers otherwise suggest. Best regards, Wim -- WLavrijsen@lbl.gov -- +1 (510) 486 6411 -- www.lavrijsen.net
participants (5)
-
Amaury Forgeot d'Arc
-
Armin Rigo
-
Davide Del Vento
-
Taavi Burns
-
wlavrijsen@lbl.gov