I was talking with Russel Winder at PyCON UK.
He says, currently, PyPy's threading does not scale properly. More below. Maybe we want to use his benchmark? Laura ------- Forwarded Message Return-Path: russel@russel.org.uk Delivery-Date: Thu Sep 29 13:53:50 2011 Subject: PyPy and multiprocessing From: Russel Winder <russel@russel.org.uk> To: Laura Creighton <lac@openend.se> - --=-I2STZOatYEgK/vXAGHYd Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Laura, I have a collection of various versions (using various features of various languages) of the embarrassingly parallel problem of calculating Pi using quadrature. It is a micro-benchmark and so suffers from all the issues they suffer from (especially on the JVM). The code is a Bazaar branch http://www.russel.org.uk/Bazaar/Pi_Quadrature. I am writing as there appears to be an interesting feature using PyPy and the microprocessing package in pool mode. This is a twin-Xeon machine so has 8 cores =E2=80=94 a 32 thread run should= only go as fast as an 8 thread run. Scaling should be linear in the number of cores. Using CPython 2.7, I get: |> python2.7 pi_python2_multiprocessing_pool.py =3D=3D=3D=3D Python Multiprocessing Pool pi =3D 3.14159265359 =3D=3D=3D=3D Python Multiprocessing Pool iteration count =3D 10000000 =3D=3D=3D=3D Python Multiprocessing Pool elapse =3D 3.5378549099 =3D=3D=3D=3D Python Multiprocessing Pool process count =3D 1 =3D=3D=3D=3D Python Multiprocessing Pool processor count =3D 8 =3D=3D=3D=3D Python Multiprocessing Pool pi =3D 3.14159265359 =3D=3D=3D=3D Python Multiprocessing Pool iteration count =3D 10000000 =3D=3D=3D=3D Python Multiprocessing Pool elapse =3D 1.97133994102 =3D=3D=3D=3D Python Multiprocessing Pool process count =3D 2 =3D=3D=3D=3D Python Multiprocessing Pool processor count =3D 8 =3D=3D=3D=3D Python Multiprocessing Pool pi =3D 3.14159265359 =3D=3D=3D=3D Python Multiprocessing Pool iteration count =3D 10000000 =3D=3D=3D=3D Python Multiprocessing Pool elapse =3D 0.515691041946 =3D=3D=3D=3D Python Multiprocessing Pool process count =3D 8 =3D=3D=3D=3D Python Multiprocessing Pool processor count =3D 8 =3D=3D=3D=3D Python Multiprocessing Pool pi =3D 3.14159265359 =3D=3D=3D=3D Python Multiprocessing Pool iteration count =3D 10000000 =3D=3D=3D=3D Python Multiprocessing Pool elapse =3D 0.521239995956 =3D=3D=3D=3D Python Multiprocessing Pool process count =3D 32 =3D=3D=3D=3D Python Multiprocessing Pool processor count =3D 8 Using PyPy 1.6 I get: |> pypy pi_python2_multiprocessing_pool.py =3D=3D=3D=3D Python Multiprocessing Pool pi =3D 3.14159265359 =3D=3D=3D=3D Python Multiprocessing Pool iteration count =3D 10000000 =3D=3D=3D=3D Python Multiprocessing Pool elapse =3D 0.249331951141 =3D=3D=3D=3D Python Multiprocessing Pool process count =3D 1 =3D=3D=3D=3D Python Multiprocessing Pool processor count =3D 8 =3D=3D=3D=3D Python Multiprocessing Pool pi =3D 3.14159265359 =3D=3D=3D=3D Python Multiprocessing Pool iteration count =3D 10000000 =3D=3D=3D=3D Python Multiprocessing Pool elapse =3D 0.104065895081 =3D=3D=3D=3D Python Multiprocessing Pool process count =3D 2 =3D=3D=3D=3D Python Multiprocessing Pool processor count =3D 8 =3D=3D=3D=3D Python Multiprocessing Pool pi =3D 3.14159265359 =3D=3D=3D=3D Python Multiprocessing Pool iteration count =3D 10000000 =3D=3D=3D=3D Python Multiprocessing Pool elapse =3D 0.0764398574829 =3D=3D=3D=3D Python Multiprocessing Pool process count =3D 8 =3D=3D=3D=3D Python Multiprocessing Pool processor count =3D 8 =3D=3D=3D=3D Python Multiprocessing Pool pi =3D 3.14159265359 =3D=3D=3D=3D Python Multiprocessing Pool iteration count =3D 10000000 =3D=3D=3D=3D Python Multiprocessing Pool elapse =3D 0.124751091003 =3D=3D=3D=3D Python Multiprocessing Pool process count =3D 32 =3D=3D=3D=3D Python Multiprocessing Pool processor count =3D 8 There is no statistical significance to these one off numbers but I am fairly confident that there are no large variations should a proper collection of data be taken. The point here is that whereas CPython shows the expected scaling, PyPy does not give the expected scaling for larger nubmers of cores. Indeed having more threads than cores is detrimental to PyPy but not to CPython. Hopefully we will soon be seeing PyPy be Python 3.2 compliant! - --=20 Russel. =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D Dr Russel Winder t: +44 20 7585 2200 voip: sip:russel.winder@ekiga.n= et 41 Buckmaster Road m: +44 7770 465 077 xmpp: russel@russel.org.uk London SW11 1EN, UK w: www.russel.org.uk skype: russel_winder - --=-I2STZOatYEgK/vXAGHYd Content-Type: application/pgp-signature; name="signature.asc" Content-Description: This is a digitally signed message part Content-Transfer-Encoding: 7bit - -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) iEYEABECAAYFAk6EWpIACgkQr2EGkixYSbrjHQCeJCcoOamtzwY3rPSFofuXACYK 7RgAnjuWAuGHI0mSUzQ/BhS/clZbZaUy =kVgo - -----END PGP SIGNATURE----- - --=-I2STZOatYEgK/vXAGHYd-- ------- End of Forwarded Message
Hi David Beazley noticed that PyPy's GIL isn't very good compared to CPython's : https://twitter.com/#!/dabeaz/status/118889721358327808 https://twitter.com/#!/dabeaz/status/118888789136523264 https://twitter.com/#!/dabeaz/status/118864260175634433 IMO it's the same issue Cheers Romain On Thu, Sep 29, 2011 at 07:17:45PM +0200, Laura Creighton wrote:
He says, currently, PyPy's threading does not scale properly. More below. Maybe we want to use his benchmark?
Laura
------- Forwarded Message
Return-Path: russel@russel.org.uk Delivery-Date: Thu Sep 29 13:53:50 2011 Subject: PyPy and multiprocessing From: Russel Winder <russel@russel.org.uk> To: Laura Creighton <lac@openend.se>
- --=-I2STZOatYEgK/vXAGHYd Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Laura,
I have a collection of various versions (using various features of various languages) of the embarrassingly parallel problem of calculating Pi using quadrature. It is a micro-benchmark and so suffers from all the issues they suffer from (especially on the JVM). The code is a Bazaar branch http://www.russel.org.uk/Bazaar/Pi_Quadrature.
I am writing as there appears to be an interesting feature using PyPy and the microprocessing package in pool mode.
This is a twin-Xeon machine so has 8 cores =E2=80=94 a 32 thread run should= only go as fast as an 8 thread run. Scaling should be linear in the number of cores.
Using CPython 2.7, I get:
|> python2.7 pi_python2_multiprocessing_pool.py =3D=3D=3D=3D Python Multiprocessing Pool pi =3D 3.14159265359 =3D=3D=3D=3D Python Multiprocessing Pool iteration count =3D 10000000 =3D=3D=3D=3D Python Multiprocessing Pool elapse =3D 3.5378549099 =3D=3D=3D=3D Python Multiprocessing Pool process count =3D 1 =3D=3D=3D=3D Python Multiprocessing Pool processor count =3D 8
=3D=3D=3D=3D Python Multiprocessing Pool pi =3D 3.14159265359 =3D=3D=3D=3D Python Multiprocessing Pool iteration count =3D 10000000 =3D=3D=3D=3D Python Multiprocessing Pool elapse =3D 1.97133994102 =3D=3D=3D=3D Python Multiprocessing Pool process count =3D 2 =3D=3D=3D=3D Python Multiprocessing Pool processor count =3D 8
=3D=3D=3D=3D Python Multiprocessing Pool pi =3D 3.14159265359 =3D=3D=3D=3D Python Multiprocessing Pool iteration count =3D 10000000 =3D=3D=3D=3D Python Multiprocessing Pool elapse =3D 0.515691041946 =3D=3D=3D=3D Python Multiprocessing Pool process count =3D 8 =3D=3D=3D=3D Python Multiprocessing Pool processor count =3D 8
=3D=3D=3D=3D Python Multiprocessing Pool pi =3D 3.14159265359 =3D=3D=3D=3D Python Multiprocessing Pool iteration count =3D 10000000 =3D=3D=3D=3D Python Multiprocessing Pool elapse =3D 0.521239995956 =3D=3D=3D=3D Python Multiprocessing Pool process count =3D 32 =3D=3D=3D=3D Python Multiprocessing Pool processor count =3D 8
Using PyPy 1.6 I get:
|> pypy pi_python2_multiprocessing_pool.py =3D=3D=3D=3D Python Multiprocessing Pool pi =3D 3.14159265359 =3D=3D=3D=3D Python Multiprocessing Pool iteration count =3D 10000000 =3D=3D=3D=3D Python Multiprocessing Pool elapse =3D 0.249331951141 =3D=3D=3D=3D Python Multiprocessing Pool process count =3D 1 =3D=3D=3D=3D Python Multiprocessing Pool processor count =3D 8
=3D=3D=3D=3D Python Multiprocessing Pool pi =3D 3.14159265359 =3D=3D=3D=3D Python Multiprocessing Pool iteration count =3D 10000000 =3D=3D=3D=3D Python Multiprocessing Pool elapse =3D 0.104065895081 =3D=3D=3D=3D Python Multiprocessing Pool process count =3D 2 =3D=3D=3D=3D Python Multiprocessing Pool processor count =3D 8
=3D=3D=3D=3D Python Multiprocessing Pool pi =3D 3.14159265359 =3D=3D=3D=3D Python Multiprocessing Pool iteration count =3D 10000000 =3D=3D=3D=3D Python Multiprocessing Pool elapse =3D 0.0764398574829 =3D=3D=3D=3D Python Multiprocessing Pool process count =3D 8 =3D=3D=3D=3D Python Multiprocessing Pool processor count =3D 8
=3D=3D=3D=3D Python Multiprocessing Pool pi =3D 3.14159265359 =3D=3D=3D=3D Python Multiprocessing Pool iteration count =3D 10000000 =3D=3D=3D=3D Python Multiprocessing Pool elapse =3D 0.124751091003 =3D=3D=3D=3D Python Multiprocessing Pool process count =3D 32 =3D=3D=3D=3D Python Multiprocessing Pool processor count =3D 8
There is no statistical significance to these one off numbers but I am fairly confident that there are no large variations should a proper collection of data be taken.
The point here is that whereas CPython shows the expected scaling, PyPy does not give the expected scaling for larger nubmers of cores. Indeed having more threads than cores is detrimental to PyPy but not to CPython.
Hopefully we will soon be seeing PyPy be Python 3.2 compliant!
- --=20 Russel. =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D Dr Russel Winder t: +44 20 7585 2200 voip: sip:russel.winder@ekiga.n= et 41 Buckmaster Road m: +44 7770 465 077 xmpp: russel@russel.org.uk London SW11 1EN, UK w: www.russel.org.uk skype: russel_winder
- --=-I2STZOatYEgK/vXAGHYd Content-Type: application/pgp-signature; name="signature.asc" Content-Description: This is a digitally signed message part Content-Transfer-Encoding: 7bit
- -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux)
iEYEABECAAYFAk6EWpIACgkQr2EGkixYSbrjHQCeJCcoOamtzwY3rPSFofuXACYK 7RgAnjuWAuGHI0mSUzQ/BhS/clZbZaUy =kVgo - -----END PGP SIGNATURE-----
- --=-I2STZOatYEgK/vXAGHYd--
------- End of Forwarded Message
_______________________________________________ pypy-dev mailing list pypy-dev@python.org http://mail.python.org/mailman/listinfo/pypy-dev
I think the slowdown you're seeing is due to the time it takes to create new processes. This seems to be quite a bit slower in PyPy than in CPython. However, once the process pool is created and has been used once, the execution time vs. process count behaves as expected. I attached a modified version of your code to demonstrate the behavior. It calculates Pi once without using multiprocessing as a baseline for comparison. Then a multiprocessing.Pool object is created with 8 processes, and the same pool is used multiple times. On my machine, creating the 8 new processes takes 0.60 seconds in PyPy and only 0.20 seconds in CPython. The pool is first used two times in a row with only a single process active. For some reason, the second run is a factor of 2 faster than the first. Is this just warmup of the JIT, or some other behavior? Next, it repeats using 2, 4, and 8 processes. This was run on a 4 core machine, and as expected there was an improvement in run time with 2 and 4 processes. Using 8 processes gives approximately the same run time as 4. The output is pasted below. I also pasted the modified code here in case the attached file doesn't come through: http://pastie.org/2614751. For reference, I'm running PyPy 1.6 on Windows 7. Sincerely, Josh C:\Users\jayers\Documents\SVN\randomStuff\pypy_comparisons>pypy-c pi_python2_multiprocessing_pool.py 3.14159265359 non parallel execution time: 1.52899980545 pool creation time: 0.559000015259 ==== Python Multiprocessing Pool pi = 3.14159265359 ==== Python Multiprocessing Pool iteration count = 100000000 ==== Python Multiprocessing Pool elapse = 3.1930000782 ==== Python Multiprocessing Pool process count = 1 ==== Python Multiprocessing Pool processor count = 4 ==== Python Multiprocessing Pool pi = 3.14159265359 ==== Python Multiprocessing Pool iteration count = 100000000 ==== Python Multiprocessing Pool elapse = 1.53900003433 ==== Python Multiprocessing Pool process count = 1 ==== Python Multiprocessing Pool processor count = 4 ==== Python Multiprocessing Pool pi = 3.14159265359 ==== Python Multiprocessing Pool iteration count = 100000000 ==== Python Multiprocessing Pool elapse = 0.802000045776 ==== Python Multiprocessing Pool process count = 2 ==== Python Multiprocessing Pool processor count = 4 ==== Python Multiprocessing Pool pi = 3.14159265359 ==== Python Multiprocessing Pool iteration count = 100000000 ==== Python Multiprocessing Pool elapse = 0.441999912262 ==== Python Multiprocessing Pool process count = 4 ==== Python Multiprocessing Pool processor count = 4 ==== Python Multiprocessing Pool pi = 3.14159265359 ==== Python Multiprocessing Pool iteration count = 100000000 ==== Python Multiprocessing Pool elapse = 0.457000017166 ==== Python Multiprocessing Pool process count = 8 ==== Python Multiprocessing Pool processor count = 4
Here's a further modified version. In this case, when using the pool for the first time, it uses an n of 10, instead of 100 million. Even with such a low precision, the first execution takes 1.3 seconds. It seems some significant warm up time is needed the first time a multiprocessing.Pool object is used. See the attachment or this link for the code: http://pastie.org/2614925 On Thu, Sep 29, 2011 at 8:24 PM, Josh Ayers <josh.ayers@gmail.com> wrote:
I think the slowdown you're seeing is due to the time it takes to create new processes. This seems to be quite a bit slower in PyPy than in CPython. However, once the process pool is created and has been used once, the execution time vs. process count behaves as expected.
I attached a modified version of your code to demonstrate the behavior. It calculates Pi once without using multiprocessing as a baseline for comparison. Then a multiprocessing.Pool object is created with 8 processes, and the same pool is used multiple times. On my machine, creating the 8 new processes takes 0.60 seconds in PyPy and only 0.20 seconds in CPython.
The pool is first used two times in a row with only a single process active. For some reason, the second run is a factor of 2 faster than the first. Is this just warmup of the JIT, or some other behavior?
Next, it repeats using 2, 4, and 8 processes. This was run on a 4 core machine, and as expected there was an improvement in run time with 2 and 4 processes. Using 8 processes gives approximately the same run time as 4.
The output is pasted below. I also pasted the modified code here in case the attached file doesn't come through: http://pastie.org/2614751. For reference, I'm running PyPy 1.6 on Windows 7.
Sincerely, Josh
C:\Users\jayers\Documents\SVN\randomStuff\pypy_comparisons>pypy-c pi_python2_multiprocessing_pool.py
3.14159265359 non parallel execution time: 1.52899980545 pool creation time: 0.559000015259 ==== Python Multiprocessing Pool pi = 3.14159265359 ==== Python Multiprocessing Pool iteration count = 100000000 ==== Python Multiprocessing Pool elapse = 3.1930000782 ==== Python Multiprocessing Pool process count = 1 ==== Python Multiprocessing Pool processor count = 4
==== Python Multiprocessing Pool pi = 3.14159265359 ==== Python Multiprocessing Pool iteration count = 100000000 ==== Python Multiprocessing Pool elapse = 1.53900003433 ==== Python Multiprocessing Pool process count = 1 ==== Python Multiprocessing Pool processor count = 4
==== Python Multiprocessing Pool pi = 3.14159265359 ==== Python Multiprocessing Pool iteration count = 100000000 ==== Python Multiprocessing Pool elapse = 0.802000045776 ==== Python Multiprocessing Pool process count = 2 ==== Python Multiprocessing Pool processor count = 4
==== Python Multiprocessing Pool pi = 3.14159265359 ==== Python Multiprocessing Pool iteration count = 100000000 ==== Python Multiprocessing Pool elapse = 0.441999912262 ==== Python Multiprocessing Pool process count = 4 ==== Python Multiprocessing Pool processor count = 4
==== Python Multiprocessing Pool pi = 3.14159265359 ==== Python Multiprocessing Pool iteration count = 100000000 ==== Python Multiprocessing Pool elapse = 0.457000017166 ==== Python Multiprocessing Pool process count = 8 ==== Python Multiprocessing Pool processor count = 4
Hi, Is the conclusion just the fact that, again, the JIT's warm-up time is important, which we know very well? Or is there some other effect that cannot be explained just by that? (BTW, Laura, it's unrelated to multithreading if it's based on the multiprocessing module.) A bientôt, Armin.
On Fri, Sep 30, 2011 at 10:20 AM, Armin Rigo <arigo@tunes.org> wrote:
Hi,
Is the conclusion just the fact that, again, the JIT's warm-up time is important, which we know very well? Or is there some other effect that cannot be explained just by that? (BTW, Laura, it's unrelated to multithreading if it's based on the multiprocessing module.)
I guess what people didn't realize is that if you spawn a new process, you have to warmup the JIT *again* for each of the worker (at least in the worst case scenario).
A bientôt,
Armin. _______________________________________________ pypy-dev mailing list pypy-dev@python.org http://mail.python.org/mailman/listinfo/pypy-dev
I don't think it's due to the warmup of the JIT. Here's a simpler example. import time import multiprocessing def do_nothing(): pass if __name__ == '__main__': time1 = time.time() do_nothing() time2 = time.time() pool = multiprocessing.Pool(processes=1) time3 = time.time() result = pool.apply_async(do_nothing) result.get() time4 = time.time() result = pool.apply_async(do_nothing) result.get() time5 = time.time() pool.close() print('not multiprocessing: ' + str(time2 - time1)) print('create pool: ' + str(time3 - time2)) print('run first time: ' + str(time4 - time3)) print('run second time: ' + str(time5 - time4)) Here are the results in PyPy. The first call to do_nothing() using multiprocessing.Pool takes 0.57 seconds. not multiprocessing: 0.0 create pool: 0.30999994278 run first time: 0.575999975204 run second time: 0.00100016593933 Here are the results in CPython. It also appears to be have some overhead the first time the pool is used, but it's less severe than PyPy. not multiprocessing: 0.0 create pool: 0.00500011444092 run first time: 0.134000062943 run second time: 0.0 On Fri, Sep 30, 2011 at 6:25 AM, Maciej Fijalkowski <fijall@gmail.com>wrote:
On Fri, Sep 30, 2011 at 10:20 AM, Armin Rigo <arigo@tunes.org> wrote:
Hi,
Is the conclusion just the fact that, again, the JIT's warm-up time is important, which we know very well? Or is there some other effect that cannot be explained just by that? (BTW, Laura, it's unrelated to multithreading if it's based on the multiprocessing module.)
I guess what people didn't realize is that if you spawn a new process, you have to warmup the JIT *again* for each of the worker (at least in the worst case scenario).
A bientôt,
Armin. _______________________________________________ pypy-dev mailing list pypy-dev@python.org http://mail.python.org/mailman/listinfo/pypy-dev
Hi, On Fri, Sep 30, 2011 at 17:54, Josh Ayers <josh.ayers@gmail.com> wrote:
I don't think it's due to the warmup of the JIT. Here's a simpler example.
I think that your example is perfectly compatible with the JIT warmup time theory. This is kind of obvious by comparing the CPython and the PyPy timings: - something that takes less than 1ms on CPython is going to be just as fast on PyPy (or at least, less than 2ms) because there is no JITting at all involved; - something that runs several seconds *in the same process* in CPython would be likely to be faster on PyPy; - everything shorter is at risk: I'd say that 0.1 to 0.5 seconds in CPython looks like the worst case for PyPy, because it needs to run the JIT but the process terminates before it's really useful. That's just what your example shows. On non-Windows I would recommend to prime the JIT my calling a few times the function, in so that a fork() can inherit already-JITted code. Of course it doesn't work on Windows. You're left with the usual remark: PyPy's JIT does have a long warm-up time for every process that is started anew, so make sure to use the multiprocessing module carefully (e.g. don't stop and restart processes all the time). A bientôt, Armin.
participants (5)
-
Armin Rigo
-
Josh Ayers
-
Laura Creighton
-
Maciej Fijalkowski
-
Romain Guillebert