Re: [Python-Dev] [Python-ideas] Remove GIL with CAS instructions?
-----Original Message----- From: python-ideas-bounces+kristjan=ccpgames.com@python.org [mailto:python-ideas-bounces+kristjan=ccpgames.com@python.org] On Behalf Of Sturla Molden Sent: 20. október 2009 22:13 To: python-ideas@python.org Subject: Re: [Python-ideas] Remove GIL with CAS instructions?
- The GIL has consequences on multicore CPUs that are overlooked: thread switches are usually missed at check intervals. This could be fixed without removing the GIL: For example, there could be a wait-queue for the GIL; a thread that request the GIL puts itself in the back.
This depends entirely on the platform and primitives used to implement the GIL. I'm interested in windows. There, I found this article: http://fonp.blogspot.com/2007/10/fairness-in-win32-lock-objects.html So, you may be on to something. Perhaps a simple C test is in order then? I did that. I found, on my dual-core vista machine, that running "release", that both Mutexes and CriticalSections behaved as you describe, with no "fairness". Using a "semaphore" seems to retain fairness, however. "fairness" was retained in debug builds too, strangely enough. Now, Python uses none of these. On windows, it uses an "Event" object coupled with an atomically updated counter. This also behaves fairly. The test application is attached. I think that you ought to sustantiate your claims better, maybe with a specific platform and using some test like the above. On the other hand, it shows that we must be careful what we use. There has been some talk of using CriticalSections for the GIL on windows. This test ought to show the danger of that. The GIL is different than a regular lock. It is a reverse-lock, really, and therefore may need to be implemented in its own special way, if we want very fast mutexes for the rest of the system (cc to python-dev) Cheers, Kristján
Hello Kristjan,
This depends entirely on the platform and primitives used to implement the GIL. I'm interested in windows.
Could you try ccbench (*) under Windows? The only Windows system I have here is a qemu virtual machine and it wouldn't be very realistic to do concurrency measurements on it. (*) http://svn.python.org/view/sandbox/trunk/ccbench/ Thanks Antoine.
Antoine Pitrou wrote:
Could you try ccbench (*) under Windows? The only Windows system I have here is a qemu virtual machine and it wouldn't be very realistic to do concurrency measurements on it.
I don't really know how this test works, so I won't claim to understand the results either. However, here you go: C:\>systeminfo OS Name: Microsoft Windows XP Professional OS Version: 5.1.2600 Service Pack 3 Build 2600 C:\>c:\Python26\python.exe Python 2.6.2 (r262:71605, Apr 14 2009, 22:40:02) [MSC v.1500 32 bit (Intel)] on win32 C:\>start /B /HIGH c:\Python26\python.exe c:\ccbench.py --- Throughput --- Pi calculation (Python) threads=1: 377 iterations/s. threads=2: 376 ( 99 %) threads=3: 380 ( 100 %) threads=4: 376 ( 99 %) regular expression (C) threads=1: 222 iterations/s. threads=2: 213 ( 95 %) threads=3: 223 ( 100 %) threads=4: 218 ( 97 %) bz2 compression (C) threads=1: 324 iterations/s. threads=2: 324 ( 99 %) threads=3: 327 ( 100 %) threads=4: 324 ( 100 %) --- Latency --- Background CPU task: Pi calculation (Python) CPU threads=0: 0 ms. (std dev: 0 ms.) CPU threads=1: 0 ms. (std dev: 0 ms.) CPU threads=2: 0 ms. (std dev: 0 ms.) CPU threads=3: 0 ms. (std dev: 0 ms.) CPU threads=4: 0 ms. (std dev: 0 ms.) Background CPU task: regular expression (C) CPU threads=0: 0 ms. (std dev: 0 ms.) CPU threads=1: 0 ms. (std dev: 0 ms.) CPU threads=2: 0 ms. (std dev: 0 ms.) CPU threads=3: 0 ms. (std dev: 0 ms.) CPU threads=4: 0 ms. (std dev: 0 ms.) Background CPU task: bz2 compression (C) CPU threads=0: 0 ms. (std dev: 0 ms.) CPU threads=1: 0 ms. (std dev: 0 ms.) CPU threads=2: 0 ms. (std dev: 0 ms.) CPU threads=3: 0 ms. (std dev: 0 ms.) CPU threads=4: 0 ms. (std dev: 0 ms.) -- Scott Dial scott@scottdial.com scodial@cs.indiana.edu
I don't really know how this test works, so I won't claim to understand the results either. However, here you go:
Thanks. Interesting results. I wonder what they would be like on a multi-core machine. The GIL seems to behave perfectly on your setup (no performance degradation due to concurrency, and zero latencies). For a quick explanation of what the benchmark does: - the "throughput" part launches N computational (CPU-bound) threads and measures the total work done per second, and then compares the result to the 1-thread result. It does so with three different workloads, which have different impacts on the GIL. 100% is the most you can get on a single-core machine. On a multi-core machine, you can get more than 100% with the workload that explicitly releases the GIL before taxing the CPU (bz2 compression). - the "latency" part launches N computational threads in the background, and the main thread listens for periodic ping messages on an UDP socket (the ping messages themselves are emitted from a separate Python process, so as to decouple it from the process under test). The latencies are the measured delay between the emission of the UDP message and the moment at which the recv() call returns in the main thread. This aims at reproducing the situation where a thread handles IO operations while one or several other threads perform heavy computations. Regards Antoine.
C:\>systeminfo OS Name: Microsoft Windows XP Professional OS Version: 5.1.2600 Service Pack 3 Build 2600
C:\>c:\Python26\python.exe Python 2.6.2 (r262:71605, Apr 14 2009, 22:40:02) [MSC v.1500 32 bit (Intel)] on win32
C:\>start /B /HIGH c:\Python26\python.exe c:\ccbench.py --- Throughput ---
Pi calculation (Python)
threads=1: 377 iterations/s. threads=2: 376 ( 99 %) threads=3: 380 ( 100 %) threads=4: 376 ( 99 %)
regular expression (C)
threads=1: 222 iterations/s. threads=2: 213 ( 95 %) threads=3: 223 ( 100 %) threads=4: 218 ( 97 %)
bz2 compression (C)
threads=1: 324 iterations/s. threads=2: 324 ( 99 %) threads=3: 327 ( 100 %) threads=4: 324 ( 100 %)
--- Latency ---
Background CPU task: Pi calculation (Python)
CPU threads=0: 0 ms. (std dev: 0 ms.) CPU threads=1: 0 ms. (std dev: 0 ms.) CPU threads=2: 0 ms. (std dev: 0 ms.) CPU threads=3: 0 ms. (std dev: 0 ms.) CPU threads=4: 0 ms. (std dev: 0 ms.)
Background CPU task: regular expression (C)
CPU threads=0: 0 ms. (std dev: 0 ms.) CPU threads=1: 0 ms. (std dev: 0 ms.) CPU threads=2: 0 ms. (std dev: 0 ms.) CPU threads=3: 0 ms. (std dev: 0 ms.) CPU threads=4: 0 ms. (std dev: 0 ms.)
Background CPU task: bz2 compression (C)
CPU threads=0: 0 ms. (std dev: 0 ms.) CPU threads=1: 0 ms. (std dev: 0 ms.) CPU threads=2: 0 ms. (std dev: 0 ms.) CPU threads=3: 0 ms. (std dev: 0 ms.) CPU threads=4: 0 ms. (std dev: 0 ms.)
Antoine Pitrou wrote:
Interesting results. I wonder what they would be like on a multi-core machine. The GIL seems to behave perfectly on your setup (no performance degradation due to concurrency, and zero latencies).
You are correct, my machine is a single-core system. I don't have any multi-core systems around to test it on, I'm still in the stone age. -- Scott Dial scott@scottdial.com scodial@cs.indiana.edu
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Antoine Pitrou wrote:
I don't really know how this test works, so I won't claim to understand the results either. However, here you go:
Thanks.
Interesting results. I wonder what they would be like on a multi-core machine. The GIL seems to behave perfectly on your setup (no performance degradation due to concurrency, and zero latencies).
C:\downloads>C:\Python26\python.exe ccbench.py - --- Throughput --- Pi calculation (Python) threads=1: 691 iterations/s. threads=2: 400 ( 57 %) threads=3: 453 ( 65 %) threads=4: 467 ( 67 %) ^- seems to have some contention regular expression (C) threads=1: 592 iterations/s. threads=2: 598 ( 101 %) threads=3: 587 ( 99 %) threads=4: 586 ( 99 %) bz2 compression (C) threads=1: 536 iterations/s. threads=2: 1056 ( 196 %) threads=3: 1040 ( 193 %) threads=4: 1060 ( 197 %) ^- seems to properly show that I have 2 cores here. - --- Latency --- Background CPU task: Pi calculation (Python) CPU threads=0: 0 ms. (std dev: 0 ms.) CPU threads=1: 0 ms. (std dev: 0 ms.) CPU threads=2: 0 ms. (std dev: 0 ms.) CPU threads=3: 0 ms. (std dev: 0 ms.) CPU threads=4: 0 ms. (std dev: 0 ms.) Background CPU task: regular expression (C) CPU threads=0: 0 ms. (std dev: 0 ms.) CPU threads=1: 38 ms. (std dev: 18 ms.) CPU threads=2: 173 ms. (std dev: 77 ms.) CPU threads=3: 518 ms. (std dev: 264 ms.) CPU threads=4: 661 ms. (std dev: 343 ms.) Background CPU task: bz2 compression (C) CPU threads=0: 0 ms. (std dev: 0 ms.) CPU threads=1: 0 ms. (std dev: 0 ms.) CPU threads=2: 0 ms. (std dev: 0 ms.) CPU threads=3: 0 ms. (std dev: 0 ms.) CPU threads=4: 0 ms. (std dev: 0 ms.) John =;-> -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (Cygwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAkrfJawACgkQJdeBCYSNAANQlgCgwx0TCLh7YhLSJxkfOuMi1/YF XhkAoIONtdP0rR1YW0nDza+wpKpAlInd =L4WZ -----END PGP SIGNATURE-----
-----Original Message----- From: python-dev-bounces+kristjan=ccpgames.com@python.org [mailto:python-dev-bounces+kristjan=ccpgames.com@python.org] On Behalf Of Antoine Pitrou Sent: 21. október 2009 10:52 To: python-dev@python.org Subject: Re: [Python-Dev] GIL behaviour under Windows
Hello Kristjan,
This depends entirely on the platform and primitives used to implement the GIL. I'm interested in windows.
Could you try ccbench (*) under Windows? The only Windows system I have here is a qemu virtual machine and it wouldn't be very realistic to do concurrency measurements on it.
Hi, I just want to stress, that according to my test, the current GIL implementation works as intended on windows. But if we were to reimplement it, say using a CriticalSection, then yielding the GIL as we do in sys.checkinterval wouldn't work as intended anymore. Just something to keep in mind in case anyone is thinking along those lines. K
-----Original Message----- Could you try ccbench (*) under Windows? The only Windows system I have here is a qemu virtual machine and it wouldn't be very realistic to do concurrency measurements on it.
I´ve run it twice on my dual core machine. It hangs every time, but not in the same place: D:\pydev\python\trunk\PCbuild>python.exe \tmp\ccbench.py malloc 262144 gave 262144, diff 0 malloc 262144 gave 262144, diff 0 malloc 262144 gave 262144, diff 0 malloc 262144 gave 262144, diff 0 malloc 262144 gave 262144, diff 0 malloc 262144 gave 262144, diff 0 --- Throughput --- Pi calculation (Python) threads=1: 514 iterations/s. threads=2: 403 ( 78 %) threads=3: 392 ( 76 %) threads=4: 364 ( 70 %) regular expression (C) threads=1: 443 iterations/s. threads=2: 474 ( 106 %) threads=3: 461 ( 104 %) threads=4: 466 ( 105 %) SHA1 hashing (C) threads=1: 983 iterations/s. threads=2: 1026 ( 104 %) ^C D:\pydev\python\trunk\PCbuild>python.exe \tmp\ccbench.py malloc 262144 gave 262144, diff 0 malloc 262144 gave 262144, diff 0 malloc 262144 gave 262144, diff 0 malloc 262144 gave 262144, diff 0 --- Throughput --- Pi calculation (Python) threads=1: 506 iterations/s. threads=2: 405 ( 80 %)
I´ve run it twice on my dual core machine. It hangs every time, but not in the same place: D:\pydev\python\trunk\PCbuild>python.exe \tmp\ccbench.py
Ah, you should report a bug then. ccbench is pure Python (and not particularly evil Python), it shouldn't be able to crash the interpreter.
Antoine Pitrou skrev:
I´ve run it twice on my dual core machine. It hangs every time, but not in the same place: D:\pydev\python\trunk\PCbuild>python.exe \tmp\ccbench.py
Ah, you should report a bug then. ccbench is pure Python (and not particularly evil Python), it shouldn't be able to crash the interpreter.
It does not crash the interpreter, but it seems it can deadlock. Here is what I get con a quadcore (Vista, Python 2.6.3). D:\>ccbench.py --- Throughput --- Pi calculation (Python) threads=1: 568 iterations/s. threads=2: 253 ( 44 %) threads=3: 274 ( 48 %) threads=4: 283 ( 49 %) regular expression (C) threads=1: 510 iterations/s. threads=2: 508 ( 99 %) threads=3: 503 ( 98 %) threads=4: 502 ( 98 %) bz2 compression (C) threads=1: 456 iterations/s. threads=2: 892 ( 195 %) threads=3: 1320 ( 289 %) threads=4: 1743 ( 382 %) --- Latency --- Background CPU task: Pi calculation (Python) CPU threads=0: 0 ms. (std dev: 0 ms.) CPU threads=1: 0 ms. (std dev: 0 ms.) CPU threads=2: 0 ms. (std dev: 0 ms.) CPU threads=3: 0 ms. (std dev: 0 ms.) CPU threads=4: 0 ms. (std dev: 0 ms.) Background CPU task: regular expression (C) CPU threads=0: 0 ms. (std dev: 0 ms.) CPU threads=1: 37 ms. (std dev: 21 ms.) CPU threads=2: 379 ms. (std dev: 175 ms.) CPU threads=3: 625 ms. (std dev: 310 ms.) CPU threads=4: 718 ms. (std dev: 381 ms.) Background CPU task: bz2 compression (C) CPU threads=0: 0 ms. (std dev: 0 ms.) CPU threads=1: 0 ms. (std dev: 0 ms.) CPU threads=2: 0 ms. (std dev: 0 ms.) CPU threads=3: 0 ms. (std dev: 0 ms.) CPU threads=4: 1 ms. (std dev: 3 ms.) D:\>ccbench.py --- Throughput --- Pi calculation (Python) threads=1: 554 iterations/s. threads=2: 400 ( 72 %) threads=3: 273 ( 49 %) threads=4: 231 ( 41 %) regular expression (C) threads=1: 508 iterations/s. threads=2: 509 ( 100 %) threads=3: 509 ( 100 %) threads=4: 509 ( 100 %) bz2 compression (C) threads=1: 456 iterations/s. threads=2: 897 ( 196 %) threads=3: 1316 ( 288 %) DEADLOCK D:\>ccbench.py --- Throughput --- Pi calculation (Python) threads=1: 559 iterations/s. threads=2: 397 ( 71 %) threads=3: 274 ( 49 %) threads=4: 238 ( 42 %) regular expression (C) threads=1: 507 iterations/s. threads=2: 499 ( 98 %) threads=3: 505 ( 99 %) threads=4: 495 ( 97 %) bz2 compression (C) threads=1: 455 iterations/s. threads=2: 896 ( 196 %) threads=3: 1320 ( 290 %) threads=4: 1736 ( 381 %) --- Latency --- Background CPU task: Pi calculation (Python) CPU threads=0: 0 ms. (std dev: 0 ms.) CPU threads=1: 0 ms. (std dev: 0 ms.) CPU threads=2: 0 ms. (std dev: 0 ms.) CPU threads=3: 0 ms. (std dev: 0 ms.) CPU threads=4: 0 ms. (std dev: 0 ms.) Background CPU task: regular expression (C) CPU threads=0: 0 ms. (std dev: 0 ms.) CPU threads=1: 34 ms. (std dev: 21 ms.) CPU threads=2: 358 ms. (std dev: 174 ms.) CPU threads=3: 619 ms. (std dev: 312 ms.) CPU threads=4: 742 ms. (std dev: 382 ms.) Background CPU task: bz2 compression (C) CPU threads=0: 0 ms. (std dev: 0 ms.) CPU threads=1: 0 ms. (std dev: 0 ms.) CPU threads=2: 0 ms. (std dev: 0 ms.) CPU threads=3: 0 ms. (std dev: 0 ms.) CPU threads=4: 6 ms. (std dev: 13 ms.)
Sturla Molden skrev:
does not crash the interpreter, but it seems it can deadlock.
Here is what I get con a quadcore (Vista, Python 2.6.3).
This what I get with affinity set to CPU 3. There are deadlocks happening at random locations in ccbench.py. It gets worse with affinity set to one processor. Sturla D:\>start /AFFINITY 3 /B ccbench.py D:\>--- Throughput --- Pi calculation (Python) threads=1: 554 iterations/s. threads=2: 257 ( 46 %) threads=3: 272 ( 49 %) threads=4: 280 ( 50 %) regular expression (C) threads=1: 501 iterations/s. threads=2: 505 ( 100 %) threads=3: 493 ( 98 %) threads=4: 507 ( 101 %) bz2 compression (C) threads=1: 455 iterations/s. threads=2: 889 ( 195 %) threads=3: 1309 ( 287 %) threads=4: 1710 ( 375 %) --- Latency --- Background CPU task: Pi calculation (Python) CPU threads=0: 0 ms. (std dev: 0 ms.) CPU threads=1: 0 ms. (std dev: 0 ms.) CPU threads=2: 0 ms. (std dev: 0 ms.) CPU threads=3: 0 ms. (std dev: 0 ms.) CPU threads=4: 0 ms. (std dev: 0 ms.) Background CPU task: regular expression (C) CPU threads=0: 0 ms. (std dev: 0 ms.) CPU threads=1: 40 ms. (std dev: 22 ms.) CPU threads=2: 384 ms. (std dev: 179 ms.) CPU threads=3: 618 ms. (std dev: 314 ms.) CPU threads=4: 713 ms. (std dev: 379 ms.) Background CPU task: bz2 compression (C) CPU threads=0: 0 ms. (std dev: 0 ms.) CPU threads=1: 0 ms. (std dev: 0 ms.) CPU threads=2: 0 ms. (std dev: 0 ms.) CPU threads=3: 0 ms. (std dev: 3 ms.) CPU threads=4: 0 ms. (std dev: 1 ms.)
Sturla Molden
It does not crash the interpreter, but it seems it can deadlock.
Kristján sent me a patch which I applied and is supposed to fix this. Anyway, thanks for the numbers. The GIL does seem to fare a bit better (zero latency with the Pi calculation in the background) than under Linux, although it may be caused by the limited resolution of time.time() under Windows. Regards Antoine.
Antoine Pitrou wrote:
Sturla Molden
writes: It does not crash the interpreter, but it seems it can deadlock.
Kristján sent me a patch which I applied and is supposed to fix this. Anyway, thanks for the numbers. The GIL does seem to fare a bit better (zero latency with the Pi calculation in the background) than under Linux, although it may be caused by the limited resolution of time.time() under Windows.
Regards
Antoine.
You can use time.clock() instead to get <15ms resolution. Changing all instances of 'time.time' to 'time.clock' gives me this result: (2-core machine, python 2.6.2) $ py ccbench.py --- Throughput --- Pi calculation (Python) threads=1: 675 iterations/s. threads=2: 388 ( 57 %) threads=3: 374 ( 55 %) threads=4: 445 ( 65 %) regular expression (C) threads=1: 588 iterations/s. threads=2: 519 ( 88 %) threads=3: 511 ( 86 %) threads=4: 513 ( 87 %) bz2 compression (C) threads=1: 536 iterations/s. threads=2: 949 ( 176 %) threads=3: 900 ( 167 %) threads=4: 927 ( 172 %) --- Latency --- Background CPU task: Pi calculation (Python) CPU threads=0: 24727 ms. (std dev: 0 ms.) CPU threads=1: 27930 ms. (std dev: 0 ms.) CPU threads=2: 31029 ms. (std dev: 0 ms.) CPU threads=3: 34170 ms. (std dev: 0 ms.) CPU threads=4: 37292 ms. (std dev: 0 ms.) Background CPU task: regular expression (C) CPU threads=0: 40454 ms. (std dev: 0 ms.) CPU threads=1: 43674 ms. (std dev: 21 ms.) CPU threads=2: 47100 ms. (std dev: 165 ms.) CPU threads=3: 50441 ms. (std dev: 304 ms.) CPU threads=4: 53707 ms. (std dev: 377 ms.) Background CPU task: bz2 compression (C) CPU threads=0: 56138 ms. (std dev: 0 ms.) CPU threads=1: 59332 ms. (std dev: 0 ms.) CPU threads=2: 62436 ms. (std dev: 0 ms.) CPU threads=3: 66130 ms. (std dev: 0 ms.) CPU threads=4: 69859 ms. (std dev: 0 ms.)
Le mercredi 21 octobre 2009 à 12:42 -0500, John Arbash Meinel a écrit :
You can use time.clock() instead to get <15ms resolution. Changing all instances of 'time.time' to 'time.clock' gives me this result:
[snip]
--- Latency ---
Background CPU task: Pi calculation (Python)
CPU threads=0: 24727 ms. (std dev: 0 ms.) CPU threads=1: 27930 ms. (std dev: 0 ms.) CPU threads=2: 31029 ms. (std dev: 0 ms.) CPU threads=3: 34170 ms. (std dev: 0 ms.) CPU threads=4: 37292 ms. (std dev: 0 ms.)
Well apparently time.clock() has a per-process time reference, which makes it unusable for this benchmark :-( (the numbers above are obviously incorrect) Regards Antoine.
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Antoine Pitrou wrote:
Le mercredi 21 octobre 2009 à 12:42 -0500, John Arbash Meinel a écrit :
You can use time.clock() instead to get <15ms resolution. Changing all instances of 'time.time' to 'time.clock' gives me this result: [snip] --- Latency ---
Background CPU task: Pi calculation (Python)
CPU threads=0: 24727 ms. (std dev: 0 ms.) CPU threads=1: 27930 ms. (std dev: 0 ms.) CPU threads=2: 31029 ms. (std dev: 0 ms.) CPU threads=3: 34170 ms. (std dev: 0 ms.) CPU threads=4: 37292 ms. (std dev: 0 ms.)
Well apparently time.clock() has a per-process time reference, which makes it unusable for this benchmark :-( (the numbers above are obviously incorrect)
Regards
Antoine.
I believe that 'time.count()' is measured as seconds since the start of the process. So yeah, I think spawning a background process will reset this counter back to 0. John =:-> -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (Cygwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAkrfTFYACgkQJdeBCYSNAAObWQCfRJsRENbcp6kuo2x1k+HvhYGZ ftsAn2PNnNHNj6D4esNBMhlSdH4IjeMA =1KWG -----END PGP SIGNATURE-----
You are right, on windows time.clock() is based relative to its first call in the process. There is no such promise made on unix. QueryPerformanceCounter() (what time.clock uses()) is a robust high resolution timer that is processor/core independent. It should be possible to use it across different processes too, if this annoying rebasing wasn't there. I wonder if we should consider this a bug? If so, I see three remedies: 1) simply using the absolute value and stop creating this arbitrary zero point. this should be ok since the same is done on unix, but it would be a break from the documented behavior. Never the less, the absolute value of this timer is irrelevant, it is the deltas that matter. 2) Add a flag to time.clock() for it to return absolute value 3) Create yet another api, either something like time.rclock() returning the absolute value or something like time.clockbase() returning the base of the zeroed clock timer. If you just want to patch locally for your timing pleasure, change line 184 of timemodule.c to: diff = (double)(now.QuadPart); K
-----Original Message----- From: python-dev-bounces+kristjan=ccpgames.com@python.org [mailto:python-dev-bounces+kristjan=ccpgames.com@python.org] On Behalf Of Antoine Pitrou
---
Background CPU task: Pi calculation (Python)
CPU threads=0: 24727 ms. (std dev: 0 ms.) CPU threads=1: 27930 ms. (std dev: 0 ms.) CPU threads=2: 31029 ms. (std dev: 0 ms.) CPU threads=3: 34170 ms. (std dev: 0 ms.) CPU threads=4: 37292 ms. (std dev: 0 ms.)
Well apparently time.clock() has a per-process time reference, which makes it unusable for this benchmark :-( (the numbers above are obviously incorrect)
Kristján Valur Jónsson
You are right, on windows time.clock() is based relative to its first call in
promise made on unix. QueryPerformanceCounter() (what time.clock uses()) is a robust high resolution timer that is processor/core independent. It should be possible to use it across different
the process. There is no such processes too, if this
annoying rebasing wasn't there.
Well, could we simply have a high-resolution time.time()? Or is Windows just too limited to provide this? Regards Antoine.
On Wed, Oct 21, 2009 at 1:51 PM, Antoine Pitrou
Kristján Valur Jónsson
writes: You are right, on windows time.clock() is based relative to its first call in the process. There is no such promise made on unix. QueryPerformanceCounter() (what time.clock uses()) is a robust high resolution timer that is processor/core independent. It should be possible to use it across different processes too, if this annoying rebasing wasn't there.
Well, could we simply have a high-resolution time.time()? Or is Windows just too limited to provide this?
Presumably you could fake something like this by combining output from an initial time(), an initial QueryPerformanceCounter() and the current QueryPerformanceCounter(). But it makes more sense to understand why someone chose to implement time.clock() on Windows the way they did -- this seems very broken to me, and I think it should be changed. Of course, there are no doubt people relying on the broken behavior... -- Curt Hagenlocher curt@hagenlocher.org
Curt Hagenlocher wrote:
But it makes more sense to understand why someone chose to implement time.clock() on Windows the way they did -- this seems very broken to me, and I think it should be changed.
Some SVN detective work takes this to all the way back to r7713 (1997-04-02). The original implementation checked by Guido and attributed to Mark Hammond. So, we should ask Mark why he did that. Can anyone honestly use it, as it is, without already having normalize it across platforms themselves? I don't know how much of an impact it is, but the current implementation of clock() does not require argument parsing, so the proposal to add a "absolute" boolean-flag argument is perhaps bad. This is generally a function used for performance timing and that proposal adds some amount of latency to the query. The proposal to add a clockbase() function is perhaps better because of this, you need only call it once, and you can cache the result for the life of your process. -- Scott Dial scott@scottdial.com scodial@cs.indiana.edu
On 22/10/2009 8:52 AM, Scott Dial wrote:
Curt Hagenlocher wrote:
But it makes more sense to understand why someone chose to implement time.clock() on Windows the way they did -- this seems very broken to me, and I think it should be changed.
Some SVN detective work takes this to all the way back to r7713 (1997-04-02). The original implementation checked by Guido and attributed to Mark Hammond. So, we should ask Mark why he did that.
The thread seems to be at http://groups.google.com/group/comp.lang.python/browse_frm/thread/be32478a4b... (although I do seem to recall more discussion of the patch which I currently can't find). I'd be very surprised if any applications rely on the fact that each process starts counting at zero, so if someone can come up with a high-res counter which avoids that artifact I'd expect it could be used. Cheers, Mark
On Thu, 2009-10-22 at 15:21 +1100, Mark Hammond wrote:
I'd be very surprised if any applications rely on the fact that each process starts counting at zero, so if someone can come up with a high-res counter which avoids that artifact I'd expect it could be used.
Could you offset it by the system time on the first call? -Rob
Robert Collins
Could you offset it by the system time on the first call?
Two problems: - the two measurements are not done simultaneously, so the result *still* does not guarantee you have the same time reference in all processes (but gives you the illusion you do, which is perhaps worse) - adding a precise measure to an imprecise measure doesn't make the result precise, but imprecise (or, rather, precise but inexact); in other words, if the system time only gives a 0.01 second resolution, adding a high-resolution timer only gives you an illusion of accuracy Therefore it seems a very bad solution. The only way, AFAICT, to do this right is for Windows to provide a high-resolution system time. It sounds astonishing that this doesn't exist. Regards Antoine.
-----Original Message----- From: python-dev-bounces+kristjan=ccpgames.com@python.org [mailto:python-dev-bounces+kristjan=ccpgames.com@python.org] On Behalf Of Robert Collins
I'd be very surprised if any applications rely on the fact that each process starts counting at zero, so if someone can come up with a high-res counter which avoids that artifact I'd expect it could be used.
Could you offset it by the system time on the first call?
Since system time has low granularity, it would negate our attemt at having time.clock() reflect the same time between processes. In my opinion, the simplest way is to simply stop setting choosing the first call as a zero base, and use whatever arbitrary time the system has chosen for us. The documentation could then read: "On Windows, this function returns wall-clock seconds elapsed since an arbitrary, system wited, epoch, as a floating point number, based on the Win32 function QueryPerformanceCounter. The resolution is typically better than one microsecond." K
On 22/10/2009 3:45 PM, Robert Collins wrote:
On Thu, 2009-10-22 at 15:21 +1100, Mark Hammond wrote:
I'd be very surprised if any applications rely on the fact that each process starts counting at zero, so if someone can come up with a high-res counter which avoids that artifact I'd expect it could be used.
Could you offset it by the system time on the first call?
Off the top of my head, I guess that depends on the actual accuracy required (ie, how many clock ticks elapse between querying the time and the high-resolution timer). Starting at 0 works fine for profiling in a single process, the predominant use-case when this was done; I guess it depends on the specific requirements and time-intervals being dealt with in the cross-process case which determines how suitable that might be? Cheers, Mark
-----Original Message----- From: python-dev-bounces+kristjan=ccpgames.com@python.org [mailto:python-dev-bounces+kristjan=ccpgames.com@python.org] On Behalf Of Mark Hammond The thread seems to be at http://groups.google.com/group/comp.lang.python/browse_frm/thread/be324 78a4b8e77b6/816d6228119a3474 (although I do seem to recall more discussion of the patch which I currently can't find). I'd be very surprised if any applications rely on the fact that each process starts counting at zero, so if someone can come up with a high-res counter which avoids that artifact I'd expect it could be used.
The point in question seems to be this this (from the thread): * Need some sort of static "start value", which is set when the process starts, so I can return to Python in seconds. An easy hack is to set this the first time clock() is called, but then it wont reflect any sort of real time - but would be useful for relative times... But the argumentation is flawed. There is an implicit "start" value (technically, CPU powerup). The point concedes that no sort of real time is returned, and so the particular "start" time chosen is immaterial. K
On 22/10/2009 8:47 PM, Kristján Valur Jónsson wrote:
The point in question seems to be this this (from the thread): * Need some sort of static "start value", which is set when the process starts, so I can return to Python in seconds. An easy hack is to set this the first time clock() is called, but then it wont reflect any sort of real time - but would be useful for relative times...
But the argumentation is flawed.
It was made in the context of the APIs available to implement this. The code is short-and-sweet in timemodule.c, so please do go ahead and fix my flawed reasoning. For reference: #if defined(MS_WINDOWS) && !defined(__BORLANDC__) /* Due to Mark Hammond and Tim Peters */ static PyObject * time_clock(PyObject *self, PyObject *unused) { static LARGE_INTEGER ctrStart; static double divisor = 0.0; LARGE_INTEGER now; double diff; if (divisor == 0.0) { LARGE_INTEGER freq; QueryPerformanceCounter(&ctrStart); if (!QueryPerformanceFrequency(&freq) || freq.QuadPart == 0) { /* Unlikely to happen - this works on all intel machines at least! Revert to clock() */ return PyFloat_FromDouble(((double)clock()) / CLOCKS_PER_SEC); } divisor = (double)freq.QuadPart; } QueryPerformanceCounter(&now); diff = (double)(now.QuadPart - ctrStart.QuadPart); return PyFloat_FromDouble(diff / divisor); } Cheers, Mark.
-----Original Message----- From: Mark Hammond [mailto:mhammond@skippinet.com.au] Sent: 22. október 2009 10:58 To: Kristján Valur Jónsson Cc: Scott Dial; python-dev@python.org It was made in the context of the APIs available to implement this. The code is short-and-sweet in timemodule.c, so please do go ahead and fix my flawed reasoning.
... I'm sorry, I don't want to start a flame war here, it just seems that if you need a zero point, and arbitrarily choose the first call to time.clock(), you could just as well use the implicit zero point already provided by the system.
-----Original Message----- From: python-dev-bounces+kristjan=ccpgames.com@python.org [mailto:python-dev-bounces+kristjan=ccpgames.com@python.org] On Behalf Of M.-A. Lemburg I'm not sure I understand why time.clock() should be considered broken.
Ah, well, not broken, but it could be even more useful: If it used the implicit system-wide epoch rather than the one based on the first call within each process, it could be useful for cross-process high-resolution timings. Anyway, it is simple enough to patch it on windows using ctypes if one needs that kind of behaviour: #nuclock.py import ctypes import time counter = ctypes.c_uint64() pcounter = ctypes.byref(counter) ctypes.windll.kernel32.QueryPerformanceFrequency(pcounter) frequency = float(counter.value) QPC = ctypes.windll.kernel32.QueryPerformanceCounter def nuclock(): QPC(pcounter) return float(counter.value)/frequency time.clock = nuclock Cheers, Kristjan
Presumably you could fake something like this by combining output from an initial time(), an initial QueryPerformanceCounter() and the current QueryPerformanceCounter(). But it makes more sense to understand why someone chose to implement time.clock() on Windows the way they did -- this seems very broken to me, and I think it should be changed.
Yes. The problem with QPC is that although it has very high resolution, it is not precise in the long term. And GetSystemTimeAsFileTime() is high precision in the long term but only updated evey 20ms or so. In EVE Online we use a combination of the two for high resolution, long term precision. But I'm not happy with the way I'm doing it. It needs some sort of smoothing of course. I've even played with using Kalman filtering to do it... The idea is to use the low frequency timer to apply correction coefficients to the high frequency timer, yet keep the flow of time smooth (no backwards jumps because of corrections.). An optimal solution has so far eluded me. Cheers, K
Kristján Valur Jónsson wrote:
Yes. The problem with QPC is that although it has very high resolution, it is not precise in the long term. And GetSystemTimeAsFileTime() is high precision in the long term but only updated evey 20ms or so. In EVE Online we use a combination of the two for high resolution, long term precision. But I'm not happy with the way I'm doing it. It needs some sort of smoothing of course. I've even played with using Kalman filtering to do it... The idea is to use the low frequency timer to apply correction coefficients to the high frequency timer, yet keep the flow of time smooth (no backwards jumps because of corrections.). An optimal solution has so far eluded me.
That sounds very similar to the problem spec for system time corrections in Network Time Protocol client implementations. Perhaps the time drifting algorithms in the NTP specs are relevant? Or are they too slow to correct discrepancies? Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia ---------------------------------------------------------------
Thanks, I'll take a look in that direction.
-----Original Message----- From: Nick Coghlan [mailto:ncoghlan@gmail.com] I've even played with using Kalman filtering to do it... The idea is
to use the low frequency timer to apply correction coefficients to the high frequency timer, yet keep the flow of time smooth (no backwards jumps because of corrections.). An optimal solution has so far eluded me.
That sounds very similar to the problem spec for system time corrections in Network Time Protocol client implementations. Perhaps the time drifting algorithms in the NTP specs are relevant? Or are they too slow to correct discrepancies?
K
Kristján Valur Jónsson skrev:
Thanks, I'll take a look in that direction.
I have a suggestion, forgive me if I am totally ignorant. :-)
Sturla Molden
#include
Sturla Molden skrev:
I have a suggestion, forgive me if I am totally ignorant. :-)
Ah, damn... Since there is a GIL, we don't need any of that crappy synchronization. And my code does not correct for the 20 ms time jitter in GetSystemTimeAsFileTime. Sorry! S.M.
Curt Hagenlocher wrote:
On Wed, Oct 21, 2009 at 1:51 PM, Antoine Pitrou
wrote: Kristján Valur Jónsson
writes: You are right, on windows time.clock() is based relative to its first call in the process. There is no such promise made on unix. QueryPerformanceCounter() (what time.clock uses()) is a robust high resolution timer that is processor/core independent. It should be possible to use it across different processes too, if this annoying rebasing wasn't there.
Well, could we simply have a high-resolution time.time()? Or is Windows just too limited to provide this?
Presumably you could fake something like this by combining output from an initial time(), an initial QueryPerformanceCounter() and the current QueryPerformanceCounter(). But it makes more sense to understand why someone chose to implement time.clock() on Windows the way they did -- this seems very broken to me, and I think it should be changed.
Of course, there are no doubt people relying on the broken behavior...
I'm not sure I understand why time.clock() should be considered broken. time.clock() is used for measuring process CPU time and is usually only used in a relative way, ie. you remember that start value, do something, then subtract the end value from the start value. For absolute CPU time values associated with a process, it's usually better to rely on other APIs such as getrusage() on Unix. You might want to have a look at the systimes module that comes with pybench for some alternative timers: Tools/pybench/systimes.py This module tries to provide a cross-platform API for process and system time: """ systimes() user and system timer implementations for use by pybench. This module implements various different strategies for measuring performance timings. It tries to choose the best available method based on the platforma and available tools. On Windows, it is recommended to have the Mark Hammond win32 package installed. Alternatively, the Thomas Heller ctypes packages can also be used. On Unix systems, the standard resource module provides the highest resolution timings. Unfortunately, it is not available on all Unix platforms. If no supported timing methods based on process time can be found, the module reverts to the highest resolution wall-clock timer instead. The system time part will then always be 0.0. The module exports one public API: def systimes(): Return the current timer values for measuring user and system time as tuple of seconds (user_time, system_time). Copyright (c) 2006, Marc-Andre Lemburg (mal@egenix.com). See the documentation for further information on copyrights, or contact the author. All Rights Reserved. """ -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Oct 22 2009)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/
M.-A. Lemburg
I'm not sure I understand why time.clock() should be considered broken.
Because in some cases you want comparable high-resolution numbers from distinct processes.
time.clock() is used for measuring process CPU time
According to the docs, under Windows it measures wall-clock time rather than CPU time. Regards Antoine.
Antoine Pitrou skrev:
Kristján sent me a patch which I applied and is supposed to fix this. Anyway, thanks for the numbers. The GIL does seem to fare a bit better (zero latency with the Pi calculation in the background) than under Linux, although it may be caused by the limited resolution of time.time() under Windows.
My critisism of the GIL on python-ideas was partly motivated by this: http://blip.tv/file/2232410 However, David Beazley is not talking about Windows. Since the GIL is apparently not a mutex on Windows, it could behave differently. So I wrote a small script that contructs a GIL battle, and record how often a check-interval results in a thread-switch or not. For monitoring check intervals, I used a small C extension to read _Py_Ticker from ceval.c. It is not declared static so I could easily hack into it. With two threads and a check interval og 100, only 61 of 100000 check intervals failed to produce a thread-switch in the interpreter. I'd call that rather fair. :-) And in case someone asks, the nthreads=1 case is just for verification. S.M. D:\>test.py check interval = 1 nthreads=1, swiched=0, missed=100000 nthreads=2, swiched=57809, missed=42191 nthreads=3, swiched=91535, missed=8465 nthreads=4, swiched=99751, missed=249 nthreads=5, swiched=95839, missed=4161 nthreads=6, swiched=100000, missed=0 D:\>test.py check interval = 10 nthreads=1, swiched=0, missed=100000 nthreads=2, swiched=99858, missed=142 nthreads=3, swiched=99992, missed=8 nthreads=4, swiched=100000, missed=0 nthreads=5, swiched=100000, missed=0 nthreads=6, swiched=100000, missed=0 D:\>test.py check interval = 100 nthreads=1, swiched=0, missed=100000 nthreads=2, swiched=99939, missed=61 nthreads=3, swiched=100000, missed=0 nthreads=4, swiched=100000, missed=0 nthreads=5, swiched=100000, missed=0 nthreads=6, swiched=100000, missed=0 D:\>test.py check interval = 1000 nthreads=1, swiched=0, missed=100000 nthreads=2, swiched=99999, missed=1 nthreads=3, swiched=100000, missed=0 nthreads=4, swiched=100000, missed=0 nthreads=5, swiched=100000, missed=0 nthreads=6, swiched=100000, missed=0
Sturla Molden skrev:
However, David Beazley is not talking about Windows. Since the GIL is apparently not a mutex on Windows, it could behave differently. So I wrote a small script that contructs a GIL battle, and record how often a check-interval results in a thread-switch or not. For monitoring check intervals, I used a small C extension to read _Py_Ticker from ceval.c. It is not declared static so I could easily hack into it.
Anyway, if anyone wants to run a GIL battle, here is the code I used. If it turns out the GIL is far worse with pthreads, as it is implemented with a mutex, it might be a good idea to reimplement it with an event object as it is on Windows. Sturla Molden ---------------- In python: from giltest import * from time import clock import threading import sys def thread(rank, battle, start): while not start.isSet(): if rank == 0: start.set() try: while 1: battle.record(rank) except: pass if __name__ == '__main__': sys.setcheckinterval(1000) print "check interval = %d" % sys.getcheckinterval() for nthreads in range(1,7): start = threading.Event() battle = GIL_Battle(100000) threads = [threading.Thread(target=thread, args=(i,battle,start)) for i in range(1,nthreads)] for t in threads: t.setDaemon(True) t.start() thread(0, battle, start) for t in threads: t.join() s,m = battle.report() print "nthreads=%d, swiched=%d, missed=%d" % (nthreads, s, m) In Cython or Pyrex: from exceptions import Exception cdef extern from *: ctypedef int vint "volatile int" vint _Py_Ticker class StopBattle(Exception): pass cdef class GIL_Battle: """ tests the fairness of the GIL """ cdef vint prev_tick, prev_rank, switched, missed cdef int trials def __cinit__(GIL_Battle self, int trials=100000): self.prev_tick = _Py_Ticker self.prev_rank = -1 self.missed = 0 self.switched = 0 self.trials = trials def record(GIL_Battle self, int rank): if self.trials == self.switched + self.missed: raise StopBattle if self.prev_rank == -1: self.prev_tick = _Py_Ticker self.prev_rank = rank else: if _Py_Ticker > self.prev_tick: if self.prev_rank == rank: self.missed += 1 else: self.switched += 1 self.prev_tick = _Py_Ticker self.prev_rank = rank else: self.prev_tick = _Py_Ticker def report(GIL_Battle self): return int(self.switched), int(self.missed)
Sturla Molden
With two threads and a check interval og 100, only 61 of 100000 check intervals failed to produce a thread-switch in the interpreter. I'd call that rather fair.
This number lacks the elapsed time. 61 switches in one second is probably enough, the same amount of switches in 10 or 20 seconds is too small (at least for threads needing good responsivity, e.g. I/O threads). Also, "fair" has to take into account the average latency and its relative stability, which is why I wrote ccbench. Antoine.
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Antoine Pitrou wrote:
Sturla Molden
writes: With two threads and a check interval og 100, only 61 of 100000 check intervals failed to produce a thread-switch in the interpreter. I'd call that rather fair.
This number lacks the elapsed time. 61 switches in one second is probably enough, the same amount of switches in 10 or 20 seconds is too small (at least for threads needing good responsivity, e.g. I/O threads).
I read Sturla as saying there were 99,939 switches out of a possible 100,000, with sys.checkinterval set to 100. Tres. - -- =================================================================== Tres Seaver +1 540-429-0999 tseaver@palladion.com Palladion Software "Excellence by Design" http://palladion.com -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAkrgkO4ACgkQ+gerLs4ltQ4yZgCfZyRfWKPBzRb52v7RtLAMlNts SOIAnjgxqJ1Wovas5cHoju8kwbvn/9Es =2SDx -----END PGP SIGNATURE-----
Tres Seaver
I read Sturla as saying there were 99,939 switches out of a possible 100,000, with sys.checkinterval set to 100.
Oops, you're right. But depending on the elapsed time (again :-)), it may be too high, because too many switches per second will add a lot of overhead and decrease performance. Regards Antoine.
I know I already posted some relevant threads to the other discussion,
but I wanted to point out a couple of specific comments on GIL
fairness from the discussion:
http://mail.python.org/pipermail/python-dev/2009-May/089752.html
http://mail.python.org/pipermail/python-dev/2009-May/089755.html
- Phillip
On Thu, Oct 22, 2009 at 10:16 AM, Antoine Pitrou
Tres Seaver
writes: I read Sturla as saying there were 99,939 switches out of a possible 100,000, with sys.checkinterval set to 100.
Oops, you're right. But depending on the elapsed time (again :-)), it may be too high, because too many switches per second will add a lot of overhead and decrease performance.
Regards
Antoine.
_______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/phillip.sitbon%2Bpython-de...
Antoine Pitrou skrev:
This number lacks the elapsed time. 61 switches in one second is probably enough, the same amount of switches in 10 or 20 seconds is too small (at least for threads needing good responsivity, e.g. I/O threads).
Also, "fair" has to take into account the average latency and its relative stability, which is why I wrote ccbench.
Since I am a scientist and statistics interests me, let's do this properly :-) Here is a suggestion: _Py_Ticker is a circular variable. Thus, it can be transformed to an angle measured in radians, using: a = 2 * pi * _Py_Ticker / _Py_CheckInterval With simultaneous measurements of a, check interval count x, and time y (µs), we can fit the multiple regression: y = b0 + b1*cos(a) + b2*sin(a) + b3*x + err using some non-linear least squares solver. We can then extract all the statistics we need on interpreter latencies for "ticks" with and without periodic checks. On a Python setup with many missed thread switches (pthreads according to D. Beazley), we could just extend the model to take into account successful and unsccessful check intervals: y = b0 + b1*cos(a) + b2*sin(a) + b3*x1 + b4*x2 + err where x1 being successful thread switches and x2 being missed thread switches. But at least on Windows we can use the simpler model. The reason why multiple regression is needed, is that the record method of my GIL_Battle class is not called on every interpreter tick. I thus cannot measure precicesly each latency, which I could have done with a direct hook into ceval.c. So statistics to the rescue. But on the bright side, it reduces the overhead of the profiler. Would that help? Sturla Molden
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Kristján Valur Jónsson wrote: ...
This depends entirely on the platform and primitives used to implement the GIL. I'm interested in windows. There, I found this article: http://fonp.blogspot.com/2007/10/fairness-in-win32-lock-objects.html So, you may be on to something. Perhaps a simple C test is in order then?
I did that. I found, on my dual-core vista machine, that running "release", that both Mutexes and CriticalSections behaved as you describe, with no "fairness". Using a "semaphore" seems to retain fairness, however. "fairness" was retained in debug builds too, strangely enough.
Now, Python uses none of these. On windows, it uses an "Event" object coupled with an atomically updated counter. This also behaves fairly.
The test application is attached.
I think that you ought to sustantiate your claims better, maybe with a specific platform and using some test like the above.
On the other hand, it shows that we must be careful what we use. There has been some talk of using CriticalSections for the GIL on windows. This test ought to show the danger of that. The GIL is different than a regular lock. It is a reverse-lock, really, and therefore may need to be implemented in its own special way, if we want very fast mutexes for the rest of the system (cc to python-dev)
Cheers,
Kristján
I can compile and run this, but I'm not sure I know how to interpret the results. If I understand it correctly, then everything but "Critical Sections" are fair on my Windows Vista machine. To run, I changed the line "#define EVENT" to EVENT, MUTEX, SEMAPHORE and CRIT. I then built and ran in "Release" environment (using VS 2008 Express) For all but CRIT, I saw things like: thread 5532 reclaims GIL thread 5532 working 51234 units thread 5532 worked 51234 units: 1312435761 thread 5532 flashing GIL thread 5876 reclaims GIL thread 5876 working 51234 units thread 5876 worked 51234 units: 1312435761 thread 5876 flashing GIL Where there would be 4 lines for one thread, then 4 lines for the other thread. for CRIT, I saw something more like 50 lines for one thread, and then 50 lines for the other thread. This is Vista Home Basic, and VS 2008 Express Edition, with a 2-core machine. John =:-> -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (Cygwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAkrfKFAACgkQJdeBCYSNAANbuQCgudU0IChylofTwvUk/JglChBd 9gsAoIJHj63/CagKpduUsd68HV8pP3QX =CuUj -----END PGP SIGNATURE-----
I'd just like to point out some previous discussion about implementing
the GIL as a critical section or semaphore on Windows since it's come
up here (although not quite the center of the OP's proposal AFAICS):
http://bugs.python.org/issue6132
http://mail.python.org/pipermail/python-dev/2009-May/089746.html
Some of this is more low-level. I did see higher performance when
using non-Event objects, although I have not had time to follow up and
do a deeper analysis. The GIL flashing "problem" with critical
sections can very likely be rectified with a call to Sleep(0) or
YieldProcessor() for those who are worried about it. On the subject of
fairness, I tested various forms of the GIL on my multi-threaded ISAPI
extension, where every millisecond counts when under high concurrency,
and fairness wasn't an issue for single- or multi-core systems. It may
be anecdotal, but it also may be that the issue is somewhat
over-blown.
It seems like these discussions come up in one form or another a few
times a year and don't really get anywhere - probably because many
people find that it's easier to just run one instance of Python on
each core/processor. IPC is cheap (cPickle rocks!), and Python's
memory footprint is acceptable by today's standards. Still, it is an
interesting topic to many, myself included.
Also, many people keep talking about inefficiencies due to threads
waking up to a locked GIL. I'd like to see examples of this- most of
the time, the OS should know that the thread is contending on the lock
object and it is skipped over. Granted, a thread may wake up just to
release the GIL shortly thereafter, but that's why
sys.setcheckinterval() is there for us to tinker with.
Anyway, enough of my $0.02.
- Phillip
2009/10/21 John Arbash Meinel
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Kristján Valur Jónsson wrote: ...
This depends entirely on the platform and primitives used to implement the GIL. I'm interested in windows. There, I found this article: http://fonp.blogspot.com/2007/10/fairness-in-win32-lock-objects.html So, you may be on to something. Perhaps a simple C test is in order then?
I did that. I found, on my dual-core vista machine, that running "release", that both Mutexes and CriticalSections behaved as you describe, with no "fairness". Using a "semaphore" seems to retain fairness, however. "fairness" was retained in debug builds too, strangely enough.
Now, Python uses none of these. On windows, it uses an "Event" object coupled with an atomically updated counter. This also behaves fairly.
The test application is attached.
I think that you ought to sustantiate your claims better, maybe with a specific platform and using some test like the above.
On the other hand, it shows that we must be careful what we use. There has been some talk of using CriticalSections for the GIL on windows. This test ought to show the danger of that. The GIL is different than a regular lock. It is a reverse-lock, really, and therefore may need to be implemented in its own special way, if we want very fast mutexes for the rest of the system (cc to python-dev)
Cheers,
Kristján
I can compile and run this, but I'm not sure I know how to interpret the results. If I understand it correctly, then everything but "Critical Sections" are fair on my Windows Vista machine.
To run, I changed the line "#define EVENT" to EVENT, MUTEX, SEMAPHORE and CRIT. I then built and ran in "Release" environment (using VS 2008 Express)
For all but CRIT, I saw things like: thread 5532 reclaims GIL thread 5532 working 51234 units thread 5532 worked 51234 units: 1312435761 thread 5532 flashing GIL thread 5876 reclaims GIL thread 5876 working 51234 units thread 5876 worked 51234 units: 1312435761 thread 5876 flashing GIL
Where there would be 4 lines for one thread, then 4 lines for the other thread.
for CRIT, I saw something more like 50 lines for one thread, and then 50 lines for the other thread.
This is Vista Home Basic, and VS 2008 Express Edition, with a 2-core machine.
John =:-> -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (Cygwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
iEYEARECAAYFAkrfKFAACgkQJdeBCYSNAANbuQCgudU0IChylofTwvUk/JglChBd 9gsAoIJHj63/CagKpduUsd68HV8pP3QX =CuUj -----END PGP SIGNATURE----- _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/phillip.sitbon%2Bpython-de...
Phillip Sitbon skrev:
Some of this is more low-level. I did see higher performance when using non-Event objects, although I have not had time to follow up and do a deeper analysis. The GIL flashing "problem" with critical sections can very likely be rectified with a call to Sleep(0) or YieldProcessor() for those who are worried about it. For those who don't know what Sleep(0) on Windows does: It returns the reminder of the current time-slice back to the system is a thread with equal or higher-priority is ready to run. Otherwise it does nothing.
GIL flashing is a serious issue if it happens often; with the current event-based GIL on Windows, it never happens (61 cases of GIL flash in 100,000 periodic checks is as good as never). S.M.
participants (14)
-
Antoine Pitrou
-
Curt Hagenlocher
-
John Arbash Meinel
-
John Arbash Meinel
-
Kristján Valur Jónsson
-
M.-A. Lemburg
-
Mark Hammond
-
Mark Hammond
-
Nick Coghlan
-
Phillip Sitbon
-
Robert Collins
-
Scott Dial
-
Sturla Molden
-
Tres Seaver