[Python-Dev] Reworking the GIL

Mon Oct 26 21:01:34 CET 2009

On Sun, Oct 25, 2009 at 1:22 PM, Antoine Pitrou <solipsis at pitrou.net> wrote:
> Having other people test it would be fine. Even better if you have an
> actual multi-threaded py3k application. But ccbench results for other
> OSes would be nice too :-)

My results for an 2.4 GHz Intel Core 2 Duo MacBook Pro (OS X 10.5.8):

Control (py3k @ r75723)

--- Throughput ---

Pi calculation (Python)

threads=1: 633 iterations/s.
threads=2: 468 ( 74 %)
threads=3: 443 ( 70 %)
threads=4: 442 ( 69 %)

regular expression (C)

threads=1: 281 iterations/s.
threads=2: 282 ( 100 %)
threads=3: 282 ( 100 %)
threads=4: 282 ( 100 %)

bz2 compression (C)

threads=1: 379 iterations/s.
threads=2: 735 ( 193 %)
threads=3: 733 ( 193 %)
threads=4: 724 ( 190 %)

--- Latency ---

Background CPU task: Pi calculation (Python)

CPU threads=0: 0 ms. (std dev: 0 ms.)
CPU threads=1: 1 ms. (std dev: 1 ms.)
CPU threads=2: 1 ms. (std dev: 2 ms.)
CPU threads=3: 3 ms. (std dev: 6 ms.)
CPU threads=4: 2 ms. (std dev: 3 ms.)

Background CPU task: regular expression (C)

CPU threads=0: 0 ms. (std dev: 0 ms.)
CPU threads=1: 975 ms. (std dev: 577 ms.)
CPU threads=2: 1035 ms. (std dev: 571 ms.)
CPU threads=3: 1098 ms. (std dev: 556 ms.)
CPU threads=4: 1195 ms. (std dev: 557 ms.)

Background CPU task: bz2 compression (C)

CPU threads=0: 0 ms. (std dev: 0 ms.)
CPU threads=1: 0 ms. (std dev: 2 ms.)
CPU threads=2: 4 ms. (std dev: 5 ms.)
CPU threads=3: 0 ms. (std dev: 0 ms.)
CPU threads=4: 1 ms. (std dev: 4 ms.)

Experiment (newgil branch @ r75723)

--- Throughput ---

Pi calculation (Python)

threads=1: 651 iterations/s.
threads=2: 643 ( 98 %)
threads=3: 637 ( 97 %)
threads=4: 625 ( 95 %)

regular expression (C)

threads=1: 298 iterations/s.
threads=2: 296 ( 99 %)
threads=3: 288 ( 96 %)
threads=4: 287 ( 96 %)

bz2 compression (C)

threads=1: 378 iterations/s.
threads=2: 720 ( 190 %)
threads=3: 724 ( 191 %)
threads=4: 718 ( 189 %)

--- Latency ---

Background CPU task: Pi calculation (Python)

CPU threads=0: 0 ms. (std dev: 0 ms.)
CPU threads=1: 0 ms. (std dev: 1 ms.)
CPU threads=2: 0 ms. (std dev: 1 ms.)
CPU threads=3: 0 ms. (std dev: 0 ms.)
CPU threads=4: 1 ms. (std dev: 5 ms.)

Background CPU task: regular expression (C)

CPU threads=0: 0 ms. (std dev: 0 ms.)
CPU threads=1: 1 ms. (std dev: 0 ms.)
CPU threads=2: 2 ms. (std dev: 1 ms.)
CPU threads=3: 2 ms. (std dev: 2 ms.)
CPU threads=4: 2 ms. (std dev: 1 ms.)

Background CPU task: bz2 compression (C)

CPU threads=0: 0 ms. (std dev: 0 ms.)
CPU threads=1: 0 ms. (std dev: 0 ms.)
CPU threads=2: 2 ms. (std dev: 3 ms.)
CPU threads=3: 0 ms. (std dev: 1 ms.)
CPU threads=4: 0 ms. (std dev: 0 ms.)

I also ran this through Unladen Swallow's threading microbenchmark,
which is a straight copy of what David Beazley was experimenting with
(simply iterating over 1000000 ints in pure Python) [1].
"iterative_count" is doing the loops one after the other,
"threaded_count" is doing the loops in parallel using threads.

The results below are benchmarking py3k as the control, newgil as the
experiment. When it says "x% faster", that is a measure of newgil's
performance over py3k's.

With two threads:

iterative_count:
Min: 0.336573 -> 0.387782: 13.21% slower  # I've run this
configuration multiple times and gotten the same slowdown.
Avg: 0.338473 -> 0.418559: 19.13% slower
Significant (t=-38.434785, a=0.95)

threaded_count:
Min: 0.529859 -> 0.397134: 33.42% faster
Avg: 0.581786 -> 0.429933: 35.32% faster
Significant (t=70.100445, a=0.95)

With four threads:

iterative_count:
Min: 0.766617 -> 0.734354: 4.39% faster
Avg: 0.771954 -> 0.751374: 2.74% faster
Significant (t=22.164103, a=0.95)
Stddev: 0.00262 -> 0.00891: 70.53% larger

threaded_count:
Min: 1.175750 -> 0.829181: 41.80% faster
Avg: 1.224157 -> 0.867506: 41.11% faster
Significant (t=161.715477, a=0.95)
Stddev: 0.01900 -> 0.01120: 69.65% smaller

With eight threads:

iterative_count:
Min: 1.527794 -> 1.447421: 5.55% faster
Avg: 1.536911 -> 1.479940: 3.85% faster
Significant (t=35.559595, a=0.95)
Stddev: 0.00394 -> 0.01553: 74.61% larger

threaded_count:
Min: 2.424553 -> 1.677180: 44.56% faster
Avg: 2.484922 -> 1.723093: 44.21% faster
Significant (t=184.766131, a=0.95)
Stddev: 0.02874 -> 0.02956: 2.78% larger

I'd be interested in multithreaded benchmarks with less-homogenous workloads.

Collin Winter

[1] - http://code.google.com/p/unladen-swallow/source/browse/tests/performance/bm_threading.py