Hello, After reading through recent Python mail regarding dictionaries and exceptions, I wondered, what is the current state of the art in Python benchmarks? I've tried before to find a definite set of Python benchmarks but failed. There doesn't seem to be an up to date reference, though if there is one and I didn't find it I would be very happy to be proven wrong. In any case, I would appreciate advice from the experts regarding what's available and what it tests. thank you, Niko Matsakis
(This is more appropriate for comp.lang.python/python-list@python.org.) Niko> After reading through recent Python mail regarding dictionaries Niko> and exceptions, I wondered, what is the current state of the art Niko> in Python benchmarks? Pybench was recently added to the repository and will be in 2.5. It works with Python as far back as 1.5.2 though. As a result of last week's NeedForSpeed sprint some questions were raised about the efficacy of its string/unicode tests, however, it seems to be the best available tool at the moment. Skip
skip@pobox.com wrote:
(This is more appropriate for comp.lang.python/python-list@python.org.)
Niko> After reading through recent Python mail regarding dictionaries Niko> and exceptions, I wondered, what is the current state of the art Niko> in Python benchmarks?
Pybench was recently added to the repository and will be in 2.5. It works with Python as far back as 1.5.2 though. As a result of last week's NeedForSpeed sprint some questions were raised about the efficacy of its string/unicode tests, however, it seems to be the best available tool at the moment.
Could you please forward such questions to me ? Steve Holden has added lots of good features to pybench during the sprint and I'm working on having it use more accurate timers (see pybench/systimes.py). Thanks, -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, May 31 2006)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::
MAL> Could you please forward such questions to me ? I suppose, though what question were you referring to? I was referring to Fredrik's thread about stringbench vs pybench for string/unicode tests, which I thought was posted to python-dev. I assumed you were aware of the issue. Skip
skip@pobox.com wrote:
MAL> Could you please forward such questions to me ?
I suppose, though what question were you referring to?
Not sure - I thought you knew ;-)
I was referring to Fredrik's thread about stringbench vs pybench for string/unicode tests, which I thought was posted to python-dev. I assumed you were aware of the issue.
I'm aware of that thread, but Fredrik only posted some vague comment to the checkins list, saying that they couldn't use pybench. I asked for some more details, but he didn't get back to me. AFAIK, there were no real issues with pybench, only with the fact that time.clock() (the timer used by pybench) is wall-time on Windows and thus an MP3-player running in the background will cause some serious noise in the measurements ;-) -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, May 31 2006)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::
MAL> I'm aware of that thread, but Fredrik only posted some vague MAL> comment to the checkins list, saying that they couldn't use MAL> pybench. I asked for some more details, but he didn't get back to MAL> me. I'm pretty sure I saw him (or maybe Andrew Dalke) post some timing comparisons between pybench and stringbench. Something about a change not impacting performance showing a 60% slowdown on pybench but no change using stringbench. Maybe Fredrik had his iTunes volume cranked up too high... ;-) Skip
skip@pobox.com wrote:
MAL> I'm aware of that thread, but Fredrik only posted some vague MAL> comment to the checkins list, saying that they couldn't use MAL> pybench. I asked for some more details, but he didn't get back to MAL> me.
I'm pretty sure I saw him (or maybe Andrew Dalke) post some timing comparisons between pybench and stringbench. Something about a change not impacting performance showing a 60% slowdown on pybench but no change using stringbench. Maybe Fredrik had his iTunes volume cranked up too high... ;-)
This is possible since pybench uses a long run strategy, whereas stringbench uses the timeit.py short run method. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, May 31 2006)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::
M.-A. Lemburg wrote:
AFAIK, there were no real issues with pybench, only with the fact that time.clock() (the timer used by pybench) is wall-time on Windows and thus an MP3-player running in the background will cause some serious noise in the measurements
oh, please; as I mentioned back then, PyBench reported massive slowdowns and huge speedups in code that wasn't touched, gave unrepeatable results on most platforms, and caused us to waste quite some time investigating potential regressions from 2.4 that simply didn't exist. of about a dozen claimed slowdowns when comparing 2.4 to 2.5a2 on several platforms, only *one* slowdown could be independently confirmed with other tools. and when we fixed that, and ended up with an implementation that was *faster* than in 2.4, PyBench didn't even notice the speedup. the fact is that the results for individual tests in PyBench are 100% unreliable. I have no idea why. the accumulated result may be somewhat useful (at least after the "use minimum time instead of average" changes), but I wouldn't use it for any serious purpose. at least PyStone is unusable in a well-defined way ;-) </F>
Fredrik Lundh wrote:
M.-A. Lemburg wrote:
AFAIK, there were no real issues with pybench, only with the fact that time.clock() (the timer used by pybench) is wall-time on Windows and thus an MP3-player running in the background will cause some serious noise in the measurements
oh, please; as I mentioned back then, PyBench reported massive slowdowns and huge speedups in code that wasn't touched, gave unrepeatable results on most platforms, and caused us to waste quite some time investigating potential regressions from 2.4 that simply didn't exist.
It would be nice if you could post examples of these results and also the details of the platforms on which you tested.
of about a dozen claimed slowdowns when comparing 2.4 to 2.5a2 on several platforms, only *one* slowdown could be independently confirmed with other tools.
and when we fixed that, and ended up with an implementation that was *faster* than in 2.4, PyBench didn't even notice the speedup.
Which one was that ? With which parameters did you run pybench ?
the fact is that the results for individual tests in PyBench are 100% unreliable. I have no idea why.
If they are 100% unreliable, then perhaps we should simply reverse the claims pybench makes ;-) (this would be like a 100% wrong wheather report). Seriously, I've been using and running pybench for years and even though tweaks to the interpreter do sometimes result in speedups or slow-downs where you wouldn't expect them (due to the interpreter using the Python objects), they are reproducable and often enough have uncovered that optimizations in one area may well result in slow-downs in other areas. Often enough the results are related to low-level features of the architecture you're using to run the code such as cache size, cache lines, number of registers in the CPU or on the FPU stack, etc. etc. E.g. some years ago, I played around with the ceval loop, ordered the switch statements in different ways, broke the switch in two parts, etc. The results were impressive - but mostly due to the switch statement being optimized for the platform I was running (AMD at the time). Moving a switch option sometimes had a huge effect on the timings. Most of which was due to the code being more suitably aligned for the CPU cache.
the accumulated result may be somewhat useful (at least after the "use minimum time instead of average" changes), but I wouldn't use it for any serious purpose. at least PyStone is unusable in a well-defined way ;-)
I think that most of your experience is related to the way timing is done on Windows. Like I said: I'm working on better timers for use in pybench. Hopefully, this will make your experience a better one next time you try. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, May 31 2006)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::
M.-A. Lemburg wrote:
Seriously, I've been using and running pybench for years and even though tweaks to the interpreter do sometimes result in speedups or slow-downs where you wouldn't expect them (due to the interpreter using the Python objects), they are reproducable and often enough have uncovered that optimizations in one area may well result in slow-downs in other areas.
Often enough the results are related to low-level features of the architecture you're using to run the code such as cache size, cache lines, number of registers in the CPU or on the FPU stack, etc. etc.
and that observation has never made you stop and think about whether there might be some problem with the benchmarking approach you're using? after all, if a change to e.g. the try/except code slows things down or speed things up, is it really reasonable to expect that the time it takes to convert Unicode strings to uppercase should suddenly change due to cache effects or a changing number of registers in the CPU? real hardware doesn't work that way... is PyBench perhaps using the following approach: T = set of tests for N in range(number of test runs): for t in T: t0 = get_process_time() t() t1 = get_process_time() assign t1 - t0 to test t print assigned time where t1 - t0 is very short? that's not a very good idea, given how get_process_time tends to be implemented on current-era systems (google for "jiffies")... but it definitely explains the bogus subtest results I'm seeing, and the "magic hardware" behaviour you're seeing. </F>
Fredrik Lundh wrote:
M.-A. Lemburg wrote:
Seriously, I've been using and running pybench for years and even though tweaks to the interpreter do sometimes result in speedups or slow-downs where you wouldn't expect them (due to the interpreter using the Python objects), they are reproducable and often enough have uncovered that optimizations in one area may well result in slow-downs in other areas.
Often enough the results are related to low-level features of the architecture you're using to run the code such as cache size, cache lines, number of registers in the CPU or on the FPU stack, etc. etc.
and that observation has never made you stop and think about whether there might be some problem with the benchmarking approach you're using?
The approach pybench is using is as follows: * Run a calibration step which does the same as the actual test without the operation being tested (ie. call the function running the test, setup the for-loop, constant variables, etc.) The calibration step is run multiple times and is used to calculate an average test overhead time. * Run the actual test which runs the operation multiple times. The test is then adjusted to make sure that the test overhead / test run ratio remains within reasonable bounds. If needed, the operation code is repeated verbatim in the for-loop, to decrease the ratio. * Repeat the above for each test in the suite * Repeat the suite N number of rounds * Calculate the average run time of all test runs in all rounds.
after all, if a change to e.g. the try/except code slows things down or speed things up, is it really reasonable to expect that the time it takes to convert Unicode strings to uppercase should suddenly change due to cache effects or a changing number of registers in the CPU? real hardware doesn't work that way...
Of course, but then changes to try-except logic can interfere with the performance of setting up method calls. This is what pybench then uncovers. The only problem I see in the above approach is the way calibration is done. The run-time of the calibration code may be to small w/r to the resolution of the used timers. Again, please provide the parameters you've used to run the test case and the output. Things like warp factor, overhead, etc. could hint to the problem you're seeing.
is PyBench perhaps using the following approach:
T = set of tests for N in range(number of test runs): for t in T: t0 = get_process_time() t() t1 = get_process_time() assign t1 - t0 to test t print assigned time
where t1 - t0 is very short?
See above (or the code in pybench.py). t1-t0 is usually around 20-50 seconds: """ The tests must set .rounds to a value high enough to let the test run between 20-50 seconds. This is needed because clock()-timing only gives rather inaccurate values (on Linux, for example, it is accurate to a few hundreths of a second). If you don't want to wait that long, use a warp factor larger than 1. """
that's not a very good idea, given how get_process_time tends to be implemented on current-era systems (google for "jiffies")... but it definitely explains the bogus subtest results I'm seeing, and the "magic hardware" behaviour you're seeing.
That's exactly the reason why tests run for a relatively long time - to minimize these effects. Of course, using wall time make this approach vulnerable to other effects such as current load of the system, other processes having a higher priority interfering with the timed process, etc. For this reason, I'm currently looking for ways to measure the process time on Windows. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Jun 02 2006)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
2006-07-03: EuroPython 2006, CERN, Switzerland 30 days left ::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::
M.-A. Lemburg wrote:
Of course, but then changes to try-except logic can interfere with the performance of setting up method calls. This is what pybench then uncovers.
I think the only thing PyBench has uncovered is that you're convinced that it's always right, and everybody else is always wrong, including people who've spent decades measuring performance, and the hardware in your own computer.
See above (or the code in pybench.py). t1-t0 is usually around 20-50 seconds:
what machines are you using? using the default parameters, the entire run takes about 50 seconds on the slowest machine I could find...
that's not a very good idea, given how get_process_time tends to be implemented on current-era systems (google for "jiffies")... but it definitely explains the bogus subtest results I'm seeing, and the "magic hardware" behaviour you're seeing.
That's exactly the reason why tests run for a relatively long time - to minimize these effects. Of course, using wall time make this approach vulnerable to other effects such as current load of the system, other processes having a higher priority interfering with the timed process, etc.
since process time is *sampled*, not measured, process time isn't exactly in- vulnerable either. it's not hard to imagine scenarios where you end up being assigned only a small part of the process time you're actually using, or cases where you're assigned more time than you've had a chance to use. afaik, if you want true performance counters on Linux, you need to patch the operating system (unless something's changed in very recent versions). I don't think that sampling errors can explain all the anomalies we've been seeing, but I'd wouldn't be surprised if a high-resolution wall time clock on a lightly loaded multiprocess system was, in practice, *more* reliable than sampled process time on an equally loaded system. </F>
Fredrik Lundh wrote:
M.-A. Lemburg wrote:
Of course, but then changes to try-except logic can interfere with the performance of setting up method calls. This is what pybench then uncovers.
I think the only thing PyBench has uncovered is that you're convinced that it's always right, and everybody else is always wrong, including people who've spent decades measuring performance, and the hardware in your own computer.
Oh, come on. You know that's not true and I'm trying to understand what is causing your findings, but this is difficult, since you're not providing enough details. E.g. the output of pybench showing the timing results would help a lot. I would also like to reproduce your findings. Do you have two revision numbers in svn which I could use for this ?
See above (or the code in pybench.py). t1-t0 is usually around 20-50 seconds:
what machines are you using? using the default parameters, the entire run takes about 50 seconds on the slowest machine I could find...
If the whole suite runs in 50 seconds, the per-test run-times are far too small to be accurate. I usually adjust the warp factor so that each *round* takes 50 seconds. Looks like I have to revisit the default parameters and update the doc-strings. I'll do that when I add the new timers. Could you check whether you still see the same results with running with "pybench.py -w 1" ?
that's not a very good idea, given how get_process_time tends to be implemented on current-era systems (google for "jiffies")... but it definitely explains the bogus subtest results I'm seeing, and the "magic hardware" behaviour you're seeing. That's exactly the reason why tests run for a relatively long time - to minimize these effects. Of course, using wall time make this approach vulnerable to other effects such as current load of the system, other processes having a higher priority interfering with the timed process, etc.
since process time is *sampled*, not measured, process time isn't exactly in- vulnerable either. it's not hard to imagine scenarios where you end up being assigned only a small part of the process time you're actually using, or cases where you're assigned more time than you've had a chance to use.
afaik, if you want true performance counters on Linux, you need to patch the operating system (unless something's changed in very recent versions).
I don't think that sampling errors can explain all the anomalies we've been seeing, but I'd wouldn't be surprised if a high-resolution wall time clock on a lightly loaded multiprocess system was, in practice, *more* reliable than sampled process time on an equally loaded system.
That's why the timers being used by pybench will become a parameter that you can then select to adapt pybench it to the OS your running pybench on. Note that time.clock, the current default timer in pybench, is a high accuracy wall-clock timer on Windows, so it should demonstrate similar behavior to timeit.py, even more so, since your using warp 20 and thus a similar timing strategy as that of timeit.py. I suspect that the calibration step is causing problems. Steve added a parameter to change the number of calibration runs done per test: -C n. The default is 20. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Jun 02 2006)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
2006-07-03: EuroPython 2006, CERN, Switzerland 30 days left ::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::
M.-A. Lemburg:
The approach pybench is using is as follows: ... The calibration step is run multiple times and is used to calculate an average test overhead time.
One of the changes that occured during the sprint was to change this algorithm to use the best time rather than the average. Using the average assumes a Gaussian distribution. Timing results are not. There is an absolute best but that's rarely reached due to background noise. It's more like a gamma distribution plus the minimum time. To show the distribution is non-Gaussian I ran the following def compute(): x = 0 for i in range(1000): for j in range(1000): x += 1 def bench(): t1 = time.time() compute() t2 = time.time() return t2-t1 times = [] for i in range(1000): times.append(bench()) print times The full distribution is attached as 'plot1.png' and the close up (range 0.45-0.65) as 'plot2.png'. Not a clean gamma function, but that's a closer match than an exponential. The gamma distribution looks more like a exponential function when the shape parameter is large. This corresponds to a large amount of noise in the system, so the run time is not close to the best time. This means the average approach works better when there is a lot of random background activity, which is not the usual case when I try to benchmark. When averaging a gamma distribution you'll end up with a bit of a skew, and I think the skew depends on the number of samples, reaching a limit point. Using the minimum time should be more precise because there is a definite lower bound and the machine should be stable. In my test above the first few results are 0.472838878632 0.473038911819 0.473326921463 0.473494052887 0.473829984665 I'm pretty certain the best time is 0.4725, or very close to that. But the average time is 0.58330151391 because of the long tail. Here are the last 6 results in my population of 1000 1.76353311539 1.79937505722 1.82750201225 2.01710510254 2.44861507416 2.90868496895 Granted, I hit a couple of web pages while doing this and my spam filter processed my mailbox in the background... There's probably some Markov modeling which would look at the number and distribution of samples so far and assuming a gamma distribution determine how many more samples are needed to get a good estimate of the absolute minumum time. But min(large enough samples) should work fine.
If the whole suite runs in 50 seconds, the per-test run-times are far too small to be accurate. I usually adjust the warp factor so that each *round* takes 50 seconds.
The stringbench.py I wrote uses the timeit algorithm which dynamically adjusts the test to run between 0.2 and 2 seconds.
That's why the timers being used by pybench will become a parameter that you can then select to adapt pybench it to the OS your running pybench on.
Wasn't that decision a consequence of the problems found during the sprint? Andrew dalke@dalkescientific.com
Andrew Dalke wrote:
M.-A. Lemburg:
The approach pybench is using is as follows: ... The calibration step is run multiple times and is used to calculate an average test overhead time.
One of the changes that occured during the sprint was to change this algorithm to use the best time rather than the average. Using the average assumes a Gaussian distribution. Timing results are not. There is an absolute best but that's rarely reached due to background noise. It's more like a gamma distribution plus the minimum time.
To show the distribution is non-Gaussian I ran the following
def compute(): x = 0 for i in range(1000): for j in range(1000): x += 1
def bench(): t1 = time.time() compute() t2 = time.time() return t2-t1
times = [] for i in range(1000): times.append(bench())
print times
The full distribution is attached as 'plot1.png' and the close up (range 0.45-0.65) as 'plot2.png'. Not a clean gamma function, but that's a closer match than an exponential.
The gamma distribution looks more like a exponential function when the shape parameter is large. This corresponds to a large amount of noise in the system, so the run time is not close to the best time. This means the average approach works better when there is a lot of random background activity, which is not the usual case when I try to benchmark.
When averaging a gamma distribution you'll end up with a bit of a skew, and I think the skew depends on the number of samples, reaching a limit point.
Using the minimum time should be more precise because there is a definite lower bound and the machine should be stable. In my test above the first few results are
0.472838878632 0.473038911819 0.473326921463 0.473494052887 0.473829984665
I'm pretty certain the best time is 0.4725, or very close to that. But the average time is 0.58330151391 because of the long tail. Here are the last 6 results in my population of 1000
1.76353311539 1.79937505722 1.82750201225 2.01710510254 2.44861507416 2.90868496895
Granted, I hit a couple of web pages while doing this and my spam filter processed my mailbox in the background...
There's probably some Markov modeling which would look at the number and distribution of samples so far and assuming a gamma distribution determine how many more samples are needed to get a good estimate of the absolute minumum time. But min(large enough samples) should work fine.
Thanks for the great analysis ! Using the minimum looks like the way to go for calibration. I wonder whether the same is true for the actual tests; since you're looking for the expected run-time, the minimum may not necessarily be the choice. Then again, in both cases you are only looking at a small number of samples (20 for the calibration, 10 for the number of rounds), so this may be irrelevant. BTW, did you run this test on Windows or a Unix machine ? There's also an interesting second high at around 0.53. What could be causing this ?
If the whole suite runs in 50 seconds, the per-test run-times are far too small to be accurate. I usually adjust the warp factor so that each *round* takes 50 seconds.
The stringbench.py I wrote uses the timeit algorithm which dynamically adjusts the test to run between 0.2 and 2 seconds.
That's why the timers being used by pybench will become a parameter that you can then select to adapt pybench it to the OS your running pybench on.
Wasn't that decision a consequence of the problems found during the sprint?
It's a consequence of a discussion I had with Steve Holden and Tim Peters: I believe that using wall-clock timers for benchmarking is not a good approach due to the high noise level. Process time timers typically have a lower resolution, but give a better picture of the actual run-time of your code and also don't exhibit as much noise as the wall-clock timer approach. Of course, you have to run the tests somewhat longer to get reasonable accuracy of the timings. Tim thinks that it's better to use short running tests and an accurate timer, accepting the added noise and counting on the user making sure that the noise level is at a minimum. Since I like to give users the option of choosing for themselves, I'm going to make the choice of timer an option. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Jun 02 2006)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
2006-07-03: EuroPython 2006, CERN, Switzerland 30 days left ::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::
M.-A. Lemburg wrote:
I believe that using wall-clock timers for benchmarking is not a good approach due to the high noise level. Process time timers typically have a lower resolution, but give a better picture of the actual run-time of your code and also don't exhibit as much noise as the wall-clock timer approach.
please stop repeating this nonsense. there are no "process time timers" in con- temporary operating systems; only tick counters. there are patches for linux and commercial add-ons to most platforms that lets you use hardware performance counters for process stuff, but there's no way to emulate that by playing with different existing Unix or Win32 API:s; the thing you think you're using simply isn't there. </F>
M.-A. Lemburg wrote:
That's why the timers being used by pybench will become a parameter that you can then select to adapt pybench it to the OS your running pybench on. Wasn't that decision a consequence of the problems found during the sprint?
It's a consequence of a discussion I had with Steve Holden and Tim Peters:
I believe that using wall-clock timers for benchmarking is not a good approach due to the high noise level. Process time timers typically have a lower resolution, but give a better picture of the actual run-time of your code and also don't exhibit as much noise as the wall-clock timer approach. Of course, you have to run the tests somewhat longer to get reasonable accuracy of the timings.
Tim thinks that it's better to use short running tests and an accurate timer, accepting the added noise and counting on the user making sure that the noise level is at a minimum.
I just had an idea: if we could get each test to run inside a single time slice assigned by the OS scheduler, then we could benefit from the better resolution of the hardware timers while still keeping the noise to a minimum. I suppose this could be achieved by: * making sure that each tests needs less than 10ms to run * calling time.sleep(0) after each test run Here's some documentation on the Linux scheduler: http://www.samspublishing.com/articles/article.asp?p=101760&seqNum=2&rl=1 Table 3.1 has the minimum time slice: 10ms. What do you think ? Would this work ? -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Jun 02 2006)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
2006-07-03: EuroPython 2006, CERN, Switzerland 30 days left ::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::
M.-A. Lemburg wrote:
I just had an idea: if we could get each test to run inside a single time slice assigned by the OS scheduler, then we could benefit from the better resolution of the hardware timers while still keeping the noise to a minimum.
I suppose this could be achieved by:
* making sure that each tests needs less than 10ms to run
iirc, very recent linux kernels have a 1 millisecond tick. so does alphas, and probably some other platforms.
* calling time.sleep(0) after each test run
so some higher priority process can get a chance to run, and spend 9.5 milliseconds shuffling data to a slow I/O device before blocking? ;-) I'm not sure this problem can be solved, really, at least not as long as you're constrained to portable API:s. (talking of which, if someone has some time and a linux box to spare, and wants to do some serious hacking on precision benchmarks, using http://user.it.uu.se/~mikpe/linux/perfctr/2.6/ to play with the TSC might be somewhat interesting.) </F>
Fredrik Lundh wrote:
M.-A. Lemburg wrote:
I just had an idea: if we could get each test to run inside a single time slice assigned by the OS scheduler, then we could benefit from the better resolution of the hardware timers while still keeping the noise to a minimum.
I suppose this could be achieved by:
* making sure that each tests needs less than 10ms to run
iirc, very recent linux kernels have a 1 millisecond tick. so does alphas, and probably some other platforms.
Indeed, that's also what the microbench.py example that I posted demonstrates. And of, course, you have to call time.sleep() *before* running the test (which microbench.py does). -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Jun 06 2006)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::
M.-A. Lemburg wrote:
That's why the timers being used by pybench will become a parameter that you can then select to adapt pybench it to the OS your running pybench on. Wasn't that decision a consequence of the problems found during the sprint? It's a consequence of a discussion I had with Steve Holden and Tim Peters:
I believe that using wall-clock timers for benchmarking is not a good approach due to the high noise level. Process time timers typically have a lower resolution, but give a better picture of the actual run-time of your code and also don't exhibit as much noise as the wall-clock timer approach. Of course, you have to run the tests somewhat longer to get reasonable accuracy of the timings.
Tim thinks that it's better to use short running tests and an accurate timer, accepting the added noise and counting on the user making sure that the noise level is at a minimum.
I just had an idea: if we could get each test to run inside a single time slice assigned by the OS scheduler, then we could benefit from the better resolution of the hardware timers while still keeping the noise to a minimum.
I suppose this could be achieved by:
* making sure that each tests needs less than 10ms to run
* calling time.sleep(0) after each test run
Here's some documentation on the Linux scheduler:
http://www.samspublishing.com/articles/article.asp?p=101760&seqNum=2&rl=1
Table 3.1 has the minimum time slice: 10ms.
What do you think ? Would this work ?
I ran some tests related to this and it appears that provide the test itself uses less than 1ms, chances are high that you don't get any forced context switches in your way while running the test. It also appears that you have to use time.sleep(10e6) to get the desired behavior. time.sleep(0) seems to receive some extra care, so doesn't have the intended effect - at least not on Linux. I've checked this on AMD64 and Intel Pentium M. The script is attached - it will run until you get more than 10 forced context switches in 100 runs of the test, incrementing the runtime of the test in each round. It's also interesting that the difference between max and min run-time of the tests can be as low as 0.2% on the Pentium, whereas the AMD64 always stays around 4-5%. On an old AMD Athlon, the difference rare goes below 50% - this might also have to do with the kernel version running on that machine which is 2.4 whereas the AMD64 and Pentium M are running 2.6. Note that is needs to the resource module, so it won't work on Windows. It's interesting that even pressing a key on your keyboard will cause forced context switches. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Jun 02 2006)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! :::: import resource, time def workload(rounds): x = 0 for i in range(rounds): x = x + 1 def microbench(): print 'Microbench' sleeptime = 10e-6 sleep = time.sleep timer = time.time rtype = resource.RUSAGE_SELF rounds = 100 while 1: times = [] rstart = resource.getrusage(rtype) for i in range(100): # Make sure the test is run at the start of a scheduling time # slice sleep(sleeptime) # Test start = timer() workload(rounds) stop = timer() times.append(stop - start) rstop = resource.getrusage(rtype) volswitches = rstop[-2] - rstart[-2] forcedswitches = rstop[-1] - rstart[-1] min_time = min(times) max_time = max(times) diff = max_time - min_time if forcedswitches == 0: print 'Rounds: %i' % rounds print ' min time: %f seconds' % min_time print ' max time: %f seconds' % max_time print ' diff: %f %% = %f seconds' % (diff / min_time * 100.0, diff) print ' context switches: %r %r' % (volswitches, forcedswitches) print elif forcedswitches > 10: break rounds += 100 microbench()
On 6/2/06, M.-A. Lemburg
It's interesting that even pressing a key on your keyboard will cause forced context switches.
When niceness was first added to multiprocessing OSes people found their CPU intensive jobs would go faster by pressing enter a lot. Andrew dalke@dalkescientific.com
"M.-A. Lemburg"
Granted, I hit a couple of web pages while doing this and my spam filter processed my mailbox in the background...
Hardly a setting in which to run comparison tests, seems to me.
Using the minimum looks like the way to go for calibration.
Or possibly the median. But even better, the way to go to run comparison timings is to use a system with as little other stuff going on as possible. For Windows, this means rebooting in safe mode, waiting until the system is quiescent, and then run the timing test with *nothing* else active that can be avoided. Even then, I would look at the distribution of times for a given test to check for anomalously high values that should be tossed. (This can be automated somewhat.) Terry Jan Reedy
Terry Reedy wrote:
But even better, the way to go to run comparison timings is to use a system with as little other stuff going on as possible. For Windows, this means rebooting in safe mode, waiting until the system is quiescent, and then run the timing test with *nothing* else active that can be avoided.
sigh. </F>
On 6/2/06, Terry Reedy
Hardly a setting in which to run comparison tests, seems to me.
The point though was to show that the time distribution is non-Gaussian, so intuition based on that doesn't help.
Using the minimum looks like the way to go for calibration.
Or possibly the median.
Why? I can't think of why that's more useful than the minimum time. Given an large number of samples the difference between the minimum and the median/average/whatever is mostly providing information about the background noise, which is pretty irrelevent to most benchmarks.
But even better, the way to go to run comparison timings is to use a system with as little other stuff going on as possible. For Windows, this means rebooting in safe mode, waiting until the system is quiescent, and then run the timing test with *nothing* else active that can be avoided.
A reason I program in Python is because I want to get work done and not deal with stoic purity. I'm not going to waste all that time (or money to buy a new machine) just to run a benchmark. Just how much more accurate would that be over the numbers we get now. Have you tried it? What additional sensitivity did you get and was the extra effort worthwhile?
Even then, I would look at the distribution of times for a given test to check for anomalously high values that should be tossed. (This can be automated somewhat.)
I say it can be automated completely. Toss all but the lowest. It's the one with the least noise overhead. I think fitting the smaller data points to a gamma distribution might yield better (more reproducible and useful) numbers but I know my stats ability is woefully decayed so I'm not going to try. My observation is that the shape factor is usually small so in a few dozen to a hundred samples there's a decent chance of getting a time with minimal noise overhead. Andrew dalke@dalkescientific.com
[MAL]
Using the minimum looks like the way to go for calibration.
[Terry Reedy]
Or possibly the median.
[Andrew Dalke]
Why? I can't think of why that's more useful than the minimum time.
A lot of things get mixed up here ;-) The _mean_ is actually useful if you're using a poor-resolution timer with a fast test. For example, suppose a test takes 1/10th the time of the span between counter ticks. Then, "on average", in 9 runs out of 10 the reported elapsed time is 0 ticks, and in 1 run out of 10 the reported time is 1 tick. 0 and 1 are both wrong, but the mean (1/10) is correct. So there _can_ be sense to that. Then people vaguely recall that the median is more robust than the mean, and all sense goes out the window ;-) My answer is to use the timer with the best resolution the machine has. Using the mean is a way to worm around timer quantization artifacts, but it's easier and clearer to use a timer with resolution so fine that quantization doesn't make a lick of real difference. Forcing a test to run for a long time is another way to make timer quantization irrelevant, but then you're also vastly increasing chances for other processes to disturb what you're testing. I liked benchmarking on Crays in the good old days. No time-sharing, no virtual memory, and the OS believed to its core that its primary purpose was to set the base address once at the start of a job so the Fortran code could scream. Test times were reproducible to the nanosecond with no effort. Running on a modern box for a few microseconds at a time is a way to approximate that, provided you measure the minimum time with a high-resolution timer :-)
On Fri, Jun 02, 2006 at 07:44:07PM -0400, Tim Peters wrote:
Fortran code could scream. Test times were reproducible to the nanosecond with no effort. Running on a modern box for a few microseconds at a time is a way to approximate that, provided you measure the minimum time with a high-resolution timer :-)
On Linux with a multi-CPU machine, you could probably boot up the system to use N-1 CPUs, and then start the Python process on CPU N. That should avoid the process being interrupted by other processes, though I guess there would still be some noise from memory bus and kernel lock contention. (At work we're trying to move toward this approach for doing realtime audio: devote one CPU to the audio computation and use other CPUs for I/O, web servers, and whatnot.) --amk
A.M. Kuchling wrote:
(At work we're trying to move toward this approach for doing realtime audio: devote one CPU to the audio computation and use other CPUs for I/O, web servers, and whatnot.)
Speaking of creative uses for multiple CPUs, I was thinking about dual-core Intel Macs the other day, and I wondered whether it would be possible to configure it so that one core was running MacOSX and the other was running Windows at the same time. It would give the term "dual booting" a whole new meaning... -- Greg
Tim Peters wrote:
I liked benchmarking on Crays in the good old days. ... Test times were reproducible to the nanosecond with no effort. Running on a modern box for a few microseconds at a time is a way to approximate that, provided you measure the minimum time with a high-resolution timer :-)
Obviously what we need here is a stand-alone Python interpreter that runs on the bare machine, so there's no pesky operating system around to mess up our times. -- Greg
Greg Ewing
Tim Peters wrote:
I liked benchmarking on Crays in the good old days. ... Test times were reproducible to the nanosecond with no effort. Running on a modern box for a few microseconds at a time is a way to approximate that, provided you measure the minimum time with a high-resolution timer :-)
Obviously what we need here is a stand-alone Python interpreter that runs on the bare machine, so there's no pesky operating system around to mess up our times.
An early version of unununium would do that (I don't know if much progress has been made since I last checked their site). - Josiah
Greg Ewing
Tim Peters wrote:
I liked benchmarking on Crays in the good old days. ... Test times were reproducible to the nanosecond with no effort. Running on a modern box for a few microseconds at a time is a way to approximate that, provided you measure the minimum time with a high-resolution timer :-)
Obviously what we need here is a stand-alone Python interpreter that runs on the bare machine, so there's no pesky operating system around to mess up our times.
I'm sure we can write a PyPy backend that targets Open Firmware :) Cheers, mwh -- <exarkun> speak of the devil <moshez> exarkun: froor <exarkun> not you -- from Twisted.Quotes
Tim:
A lot of things get mixed up here ;-) The _mean_ is actually useful if you're using a poor-resolution timer with a fast test.
In which case discrete probability distributions are better than my assumption of a continuous distribution. I looked at the distribution of times for 1,000 repeats of t1 = time.time() t2 = time.time() times.append(t2-t1) The times and counts I found were 9.53674316406e-07 388 1.19209289551e-06 95 1.90734863281e-06 312 2.14576721191e-06 201 2.86102294922e-06 2 1.90734863281e-05 1 3.00407409668e-05 1 This implies my Mac's time.time() has a resolution of 2.3841857910000015e-07 s (0.2µs or about 4.2MHz.) Or possibily a small integer fraction thereof. The timer overhead takes between 4 and 9 ticks. Ignoring the outliers, assuming I have the CPU all to my benchmark for the timeslice then I expect about +/- 3 ticks of noise per test. To measure 1% speedup reliably I'll need to run, what, 300-600 ticks? That's a millisecond, and with a time quantum of 10 ms there's a 1 in 10 chance that I'll incur that overhead. In other words, I don't think my high-resolution timer is high enough. Got a spare Cray I can use, and will you pay for the power bill? Andrew dalke@dalkescientific.com
Fredrik Lundh wrote:
since process time is *sampled*, not measured, process time isn't exactly in- vulnerable either.
I can't share that view. The scheduler knows *exactly* what thread is running on the processor at any time, and that thread won't change until the scheduler makes it change. So if you discount time spent in interrupt handlers (which might be falsely accounted for the thread that happens to run at the point of the interrupt), then process time *is* measured, not sampled, on any modern operating system: it is updated whenever the scheduler schedules a different thread. Of course, the question still is what the resolution of the clock is that makes these measurements. For Windows NT+, I would expect it to be "quantum units", but I'm uncertain whether it could measure also fractions of a quantum unit if the process does a blocking call.
I don't think that sampling errors can explain all the anomalies we've been seeing, but I'd wouldn't be surprised if a high-resolution wall time clock on a lightly loaded multiprocess system was, in practice, *more* reliable than sampled process time on an equally loaded system.
On Linux, process time is accounted in jiffies. Unfortunately, for compatibility, times(2) converts that to clock_t, losing precision. Regards, Martin
Martin v. Löwis wrote:
since process time is *sampled*, not measured, process time isn't exactly in- vulnerable either.
I can't share that view. The scheduler knows *exactly* what thread is running on the processor at any time, and that thread won't change until the scheduler makes it change. So if you discount time spent in interrupt handlers (which might be falsely accounted for the thread that happens to run at the point of the interrupt), then process time *is* measured, not sampled, on any modern operating system: it is updated whenever the scheduler schedules a different thread.
updated with what? afaik, the scheduler doesn't have to wait for a timer interrupt to reschedule things (think blocking, or interrupts that request rescheduling, or new processes, or...) -- but it's always the thread that runs when the timer interrupt arrives that gets the entire jiffy time. for example, this script runs for ten seconds, usually without using any process time at all: import time for i in range(1000): for i in range(1000): i+i+i+i time.sleep(0.005) while the same program, without the sleep, will run for a second or two, most of which is assigned to the process. if the scheduler used the TSC to keep track of times, it would be *measuring* process time. but unless something changed very recently, it doesn't. it's all done by sampling, typically 100 or 1000 times per second.
On Linux, process time is accounted in jiffies. Unfortunately, for compatibility, times(2) converts that to clock_t, losing precision.
times(2) reports time in 1/CLOCKS_PER_SEC second units, while jiffies
are counted in 1/HZ second units. on my machine, CLOCKS_PER_SEC is a
thousand times larger than HZ. what does this code print on your machine?
#include
Fredrik Lundh wrote:
it is updated whenever the scheduler schedules a different thread.
updated with what? afaik, the scheduler doesn't have to wait for a timer interrupt to reschedule things (think blocking, or interrupts that request rescheduling, or new processes, or...) -- but it's always the thread that runs when the timer interrupt arrives that gets the entire jiffy time.
Sure: when a thread doesn't consume its entire quantum, accounting becomes difficult. Still, if the scheduler reads the current time when scheduling, it measures the time consumed.
if the scheduler used the TSC to keep track of times, it would be *measuring* process time. but unless something changed very recently, it doesn't.
You mean, "unless something changed very recently" *on Linux*, right? Or when did you last read the sources of Windows XP? It would still be measuring if the scheduler reads the latest value of some system clock, although that would be much less accurate than reading the TSC.
times(2) reports time in 1/CLOCKS_PER_SEC second units, while jiffies are counted in 1/HZ second units. on my machine, CLOCKS_PER_SEC is a thousand times larger than HZ. what does this code print on your machine?
You are right; clock_t allows for higher precision than jiffies. Regards, Martin
Martin v. Löwis wrote:
Sure: when a thread doesn't consume its entire quantum, accounting becomes difficult. Still, if the scheduler reads the current time when scheduling, it measures the time consumed.
yeah, but the point is that it *doesn't* read the current time: all the system does it to note that "alright, we've reached the end of another jiffy, and this thread was running at that point. now, was it running in user space or in kernel space when we interrupted it?". here's the relevant code, from kernel/timer.c and kernel/sched.c: #define jiffies_to_cputime(__hz) (__hz) void update_process_times(int user_tick) { struct task_struct *p = current; int cpu = smp_processor_id(); if (user_tick) account_user_time(p, jiffies_to_cputime(1)); else account_system_time(p, HARDIRQ_OFFSET, jiffies_to_cputime(1)); run_local_timers(); if (rcu_pending(cpu)) rcu_check_callbacks(cpu, user_tick); scheduler_tick(); run_posix_cpu_timers(p); } void account_user_time(struct task_struct *p, cputime_t cputime) { struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat; cputime64_t tmp; p->utime = cputime_add(p->utime, cputime); tmp = cputime_to_cputime64(cputime); if (TASK_NICE(p) > 0) cpustat->nice = cputime64_add(cpustat->nice, tmp); else cpustat->user = cputime64_add(cpustat->user, tmp); } (update_process_times is called by the hardware timer interrupt handler, once per jiffy, HZ times per second. task_struct contains information about a single thread, cpu_usage_stat is global stats for a CPU) for the benchmarks, the problem is of course not that the benchmarking thread gives up too early; it's when other processes give up early, and the benchmark process is next in line. in that case, the benchmark won't use a whole jiffy, but it's still charged for a full jiffy interval by the interupt handler (in my sleep test, *other processes* got charged for the time the program spent running that inner loop). a modern computer can to *lots of stuff* in a single jiffy interval (whether it's 15 ms, 10 ms, 4 ms, or 1 ms), and even more in a single scheduler quantum (=a number of jiffies).
You mean, "unless something changed very recently" *on Linux*, right?
on any system involved in this discussion. they all worked the same way, last time I checked ;-)
Or when did you last read the sources of Windows XP?
afaik, all Windows versions based on the current NT kernel (up to and including XP) uses tick-based sampling. I don't know about Vista; given the platform requirements for Vista, it's perfectly possible that they've switched to TSC-based accounting.
It would still be measuring if the scheduler reads the latest value of some system clock, although that would be much less accurate than reading the TSC.
hopefully, this is the last time I will have to repeat this, but on both Windows and Linux, the "system clock" used for process timing is a jiffy counter. </F>
[Fredrik Lundh]
.... but it's always the thread that runs when the timer interrupt arrives that gets the entire jiffy time. for example, this script runs for ten seconds, usually without using any process time at all:
import time for i in range(1000): for i in range(1000): i+i+i+i time.sleep(0.005)
while the same program, without the sleep, will run for a second or two, most of which is assigned to the process.
Nice example! On my desktop box (WinXP, 3.4GHz), I had to make it nastier to see it consume any "time" without the sleep: import time for i in range(1000): for i in range(10000): # 10x bigger i+i+i+i*(i+i+i+i) # more work time.sleep(0.005) raw_input("done") The raw_input is there so I can see Task Manager's idea of elapsed "CPU Time" (sum of process "user time" and "kernel time") when it's done. Without the sleep, it gets charged 6 CPU seconds. With the sleep, 0 CPU seconds. But life would be more boring if people believed you the first time ;-)
Tim Peters wrote:
Without the sleep, it gets charged 6 CPU seconds. With the sleep, 0 CPU seconds.
But life would be more boring if people believed you the first time ;-)
This only proves that it uses clock ticks for the accounting, and not something with higher resolution. To find out whether it samples or measures CPU usage, you really have to read the source code of the operating system (or find some documentation of somebody who has seen the source code). Regards, Martin
Fredrik Lundh wrote:
Martin v. Löwis wrote:
since process time is *sampled*, not measured, process time isn't exactly in- vulnerable either. I can't share that view. The scheduler knows *exactly* what thread is running on the processor at any time, and that thread won't change until the scheduler makes it change. So if you discount time spent in interrupt handlers (which might be falsely accounted for the thread that happens to run at the point of the interrupt), then process time *is* measured, not sampled, on any modern operating system: it is updated whenever the scheduler schedules a different thread.
updated with what? afaik, the scheduler doesn't have to wait for a timer interrupt to reschedule things (think blocking, or interrupts that request rescheduling, or new processes, or...) -- but it's always the thread that runs when the timer interrupt arrives that gets the entire jiffy time. for example, this script runs for ten seconds, usually without using any process time at all:
import time for i in range(1000): for i in range(1000): i+i+i+i time.sleep(0.005)
while the same program, without the sleep, will run for a second or two, most of which is assigned to the process.
if the scheduler used the TSC to keep track of times, it would be *measuring* process time. but unless something changed very recently, it doesn't. it's all done by sampling, typically 100 or 1000 times per second.
This example is a bit misleading, since chances are high that the benchmark will get a good priority bump by the scheduler.
On Linux, process time is accounted in jiffies. Unfortunately, for compatibility, times(2) converts that to clock_t, losing precision.
times(2) reports time in 1/CLOCKS_PER_SEC second units, while jiffies are counted in 1/HZ second units. on my machine, CLOCKS_PER_SEC is a thousand times larger than HZ. what does this code print on your machine?
You should use getrusage() for user and system time or even better clock_gettime() (the POSIX real-time APIs).
From the man-page of times:
RETURN VALUE The function times returns the number of clock ticks that have elapsed since an arbitrary point in the past. ... The number of clock ticks per second can be obtained using sysconf(_SC_CLK_TCK); On my Linux system this returns 100. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Jun 06 2006)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::
FWIW, these are my findings on the various timing strategies: * Windows: time.time() - not usable; I get timings with an error interval of roughly 30% GetProcessTimes() - not usable; I get timings with an error interval of up to 100% with differences in steps of 15.626ms time.clock() - error interval of less than 10%; overall < 0.5% * Linux: time.clock() - not usable; I get timings with error interval of about 30% with differences in steps of 100ms time.time() - error interval of less than 10%; overall < 0.5% resource.getrusage() - error interval of less than 10%; overall < 0.5% with differences in steps of 10ms clock_gettime() - these don't appear to work on my box; even though clock_getres() returns a promising 1ns. All measurements were done on AMD64 boxes, using Linux 2.6 and WinXP Pro with Python 2.4. pybench 2.0 was used (which is not yet checked in) and the warp factor was set to a value that gave benchmark rounds times of between 2.5 and 3.5 seconds, ie. short test run-times. Overall, time.clock() on Windows and time.time() on Linux appear to give the best repeatability of tests, so I'll make those the defaults in pybench 2.0. In short: Tim wins, I lose. Was a nice experiment, though ;-) One interesting difference I found while testing on Windows vs. Linux is that the StringMappings test have quite a different run-time on both systems: around 2500ms on Windows vs. 590ms on Linux (on Python 2.4). UnicodeMappings doesn't show such a signficant difference. Perhaps the sprint changed this ?! -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Jun 06 2006)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::
M.-A. Lemburg wrote:
* Linux:
time.clock() - not usable; I get timings with error interval of about 30% with differences in steps of 100ms
resource.getrusage() - error interval of less than 10%; overall < 0.5% with differences in steps of 10ms
hmm. I could have sworn clock() returned the sum of the utime and stime fields you get from getrusage() (which is the sum of the utime and stime tick counters for all tasks associated with the process, converted from jiffy count to the appropriate time unit), but glibc is one big maze of twisty little passages, so I'm probably looking at the wrong place. oh, well. </F>
M.-A. Lemburg wrote:
FWIW, these are my findings on the various timing strategies:
Correction (due to a bug in my pybench dev version):
* Windows:
time.time() - not usable; I get timings with an error interval of roughly 30%
GetProcessTimes() - not usable; I get timings with an error interval of up to 100% with differences in steps of 15.626ms
time.clock() - error interval of less than 10%; overall < 0.5%
* Linux:
time.clock() - not usable; I get timings with error interval of about 30% with differences in steps of 100ms
This should read: steps of 10ms. time.clock() uses POSIX clock ticks which are hard-wired to 100Hz.
time.time() - error interval of less than 10%; overall < 0.5%
resource.getrusage() - error interval of less than 10%; overall < 0.5% with differences in steps of 10ms
This should read: steps of 1ms. The true clock tick frequency on the test machine is 1kHz.
clock_gettime() - these don't appear to work on my box; even though clock_getres() returns a promising 1ns.
All measurements were done on AMD64 boxes, using Linux 2.6 and WinXP Pro with Python 2.4. pybench 2.0 was used (which is not yet checked in) and the warp factor was set to a value that gave benchmark rounds times of between 2.5 and 3.5 seconds, ie. short test run-times.
Overall, time.clock() on Windows and time.time() on Linux appear to give the best repeatability of tests, so I'll make those the defaults in pybench 2.0.
In short: Tim wins, I lose.
Was a nice experiment, though ;-)
One interesting difference I found while testing on Windows vs. Linux is that the StringMappings test have quite a different run-time on both systems: around 2500ms on Windows vs. 590ms on Linux (on Python 2.4). UnicodeMappings doesn't show such a signficant difference.
Perhaps the sprint changed this ?!
-- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Jun 06 2006)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::
M.-A. Lemburg wrote: [...]
Overall, time.clock() on Windows and time.time() on Linux appear to give the best repeatability of tests, so I'll make those the defaults in pybench 2.0.
In short: Tim wins, I lose.
Was a nice experiment, though ;-)
Perhaps so, but it would have been nice if you could have come to this conclusion before asking me not to make this change, which would otherwise have been checked in two weeks ago. Still, as long as we can all agree on this and move forward I suppose the intervening debate at least leaves us better-informed. regards Steve -- Steve Holden +44 150 684 7255 +1 800 494 3119 Holden Web LLC/Ltd http://www.holdenweb.com Love me, love my blog http://holdenweb.blogspot.com Recent Ramblings http://del.icio.us/steve.holden
Steve Holden wrote:
M.-A. Lemburg wrote: [...]
Overall, time.clock() on Windows and time.time() on Linux appear to give the best repeatability of tests, so I'll make those the defaults in pybench 2.0.
In short: Tim wins, I lose.
Was a nice experiment, though ;-)
Perhaps so, but it would have been nice if you could have come to this conclusion before asking me not to make this change, which would otherwise have been checked in two weeks ago.
I still believe that measuring process time is better than wall time and have tried hard to find suitable timers for implementing this. However, the tests using the various different timers have shown that this approach doesn't work out due to the problems with how process time is measured on the platforms in question (Linux and Windows). We should revisit this choice once suitable timers are available on those platforms. Note that even with the wall-time timers and using the minimum function as estimator you still get results which exhibit random noise. At least now we know that there's apparently no way to get it removed.
Still, as long as we can all agree on this and move forward I suppose the intervening debate at least leaves us better-informed.
Isn't that the whole point of such a discussion ? I'll check in pybench 2.0 once I've tested it enough. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Jun 07 2006)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::
M.-A. Lemburg wrote:
One interesting difference I found while testing on Windows vs. Linux is that the StringMappings test have quite a different run-time on both systems: around 2500ms on Windows vs. 590ms on Linux (on Python 2.4). UnicodeMappings doesn't show such a signficant difference.
Perhaps the sprint changed this ?!
nope. but stringbench revealed the same thing, of course. the difference is most likely due to an inefficient implementation of locale-aware character lookups in Visual C (MSVC supports passing wide chars, but I don't think gcc bothers to do that; afaik, it's not part of the C standard). solving this is straightforward (let the locale module set a global flag if Python runs under a non-C locale, and use a built-in table as long as that flag isn't set), but I haven't gotten around to deal with that yet. </F>
Some more interesting results from comparing Python 2.4 (other) against the current SVN snapshot (this): Testnames minimum run-time average run-time this other diff this other diff ------------------------------------------------------------------------------- BuiltinMethodLookup: 142ms 124ms +14.5% 151ms 134ms +13.1% ConcatUnicode: 95ms 120ms -20.7% 104ms 131ms -20.5% CreateInstances: 102ms 92ms +10.0% 107ms 96ms +11.5% CreateUnicodeWithConcat: 98ms 122ms -19.2% 103ms 129ms -20.1% DictWithFloatKeys: 128ms 149ms -14.1% 133ms 177ms -24.8% NestedForLoops: 141ms 126ms +11.8% 144ms 128ms +12.3% PythonFunctionCalls: 131ms 108ms +21.5% 133ms 109ms +21.3% SecondImport: 135ms 114ms +18.3% 140ms 117ms +20.0% SecondPackageImport: 136ms 122ms +11.2% 144ms 124ms +16.1% SecondSubmoduleImport: 166ms 146ms +13.5% 171ms 148ms +15.9% SimpleComplexArithmetic: 106ms 131ms -19.1% 112ms 133ms -16.2% StringPredicates: 109ms 96ms +13.6% 114ms 99ms +15.7% TryRaiseExcept: 119ms 137ms -13.3% 122ms 140ms -12.6% UnicodeMappings: 140ms 157ms -10.8% 141ms 160ms -11.4% UnicodePredicates: 111ms 98ms +12.9% 115ms 100ms +15.3% UnicodeSlicing: 101ms 114ms -11.2% 108ms 116ms -6.7% It appears as if the import mechanism took a hit between the versions. The NFS sprint results are also visible. A little disturbing is the slow-down for Python function calls and the built-in method lookup. Did anything change in these parts of the interpreter ? This is the machine I used for running the pybench: Timer: time.time Machine Details: Platform ID: Linux-2.6.8-24.19-default-x86_64-with-SuSE-9.2-x86-64 Processor: x86_64 -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Jun 07 2006)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::
M.-A. Lemburg wrote:
Some more interesting results from comparing Python 2.4 (other) against the current SVN snapshot (this):
Here's the list again, this time without wrapping (sigh): Testnames minimum run-time average run-time this other diff this other diff ------------------------------------------------------------------------------- BuiltinMethodLookup: 141ms 124ms +13.9% 148ms 134ms +10.8% ConcatUnicode: 97ms 120ms -19.5% 104ms 131ms -20.6% CreateInstances: 102ms 92ms +10.3% 104ms 96ms +8.0% CreateUnicodeWithConcat: 98ms 122ms -19.1% 103ms 129ms -20.6% DictWithFloatKeys: 128ms 149ms -14.4% 130ms 177ms -26.4% NestedForLoops: 140ms 126ms +11.1% 143ms 128ms +11.8% PythonFunctionCalls: 130ms 108ms +21.3% 132ms 109ms +20.9% SecondImport: 136ms 114ms +18.9% 138ms 117ms +18.2% SecondPackageImport: 141ms 122ms +15.4% 143ms 124ms +15.3% SecondSubmoduleImport: 166ms 146ms +13.3% 179ms 148ms +21.3% SimpleComplexArithmetic: 107ms 131ms -18.5% 121ms 133ms -9.2% StringPredicates: 109ms 96ms +13.5% 117ms 99ms +18.7% TryRaiseExcept: 115ms 137ms -16.2% 129ms 140ms -7.6% UnicodeMappings: 140ms 157ms -10.7% 142ms 160ms -11.3% UnicodePredicates: 111ms 98ms +13.3% 115ms 100ms +15.6% UnicodeSlicing: 103ms 114ms -10.1% 108ms 116ms -6.7%
It appears as if the import mechanism took a hit between the versions.
The NFS sprint results are also visible.
A little disturbing is the slow-down for Python function calls and the built-in method lookup. Did anything change in these parts of the interpreter ?
This is the machine I used for running the pybench: Timer: time.time Machine Details: Platform ID: Linux-2.6.8-24.19-default-x86_64-with-SuSE-9.2-x86-64 Processor: x86_64
-- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Jun 07 2006)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::
M.-A. Lemburg wrote:
Some more interesting results from comparing Python 2.4 (other) against the current SVN snapshot (this):
been there, done that, found the results lacking. we spent a large part of the first NFS day to investigate all reported slowdowns, and found that only one slowdown could be independently confirmed (the exception issue). try running some of your subtests under timeit, and see if you can repeat the results. </F>
Fredrik Lundh wrote:
M.-A. Lemburg wrote:
Some more interesting results from comparing Python 2.4 (other) against the current SVN snapshot (this):
been there, done that, found the results lacking.
we spent a large part of the first NFS day to investigate all reported slowdowns, and found that only one slowdown could be independently confirmed (the exception issue).
try running some of your subtests under timeit, and see if you can repeat the results.
The results were produced by pybench 2.0 and use time.time on Linux, plus a different calibration strategy. As a result these timings are a lot more repeatable than with pybench 1.3 and I've confirmed the timings using several runs to make sure. Still, here's the timeit.py measurement of the PythonFunctionCall test (note that I've scaled down the test in terms of number of rounds for timeit.py): Python 2.4: 10 loops, best of 3: 21.9 msec per loop 10 loops, best of 3: 21.8 msec per loop 10 loops, best of 3: 21.8 msec per loop 10 loops, best of 3: 21.9 msec per loop 10 loops, best of 3: 21.9 msec per loop Python 2.5 as of last night: 100 loops, best of 3: 18 msec per loop 100 loops, best of 3: 18.4 msec per loop 100 loops, best of 3: 18.4 msec per loop 100 loops, best of 3: 18.2 msec per loop The pybench 2.0 result: PythonFunctionCalls: 130ms 108ms +21.3% 132ms 109ms +20.9% Looks about right, I'd say. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Jun 08 2006)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
2006-07-03: EuroPython 2006, CERN, Switzerland 24 days left ::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::
M.-A. Lemburg wrote:
Still, here's the timeit.py measurement of the PythonFunctionCall test (note that I've scaled down the test in terms of number of rounds for timeit.py):
Python 2.4: 10 loops, best of 3: 21.9 msec per loop 10 loops, best of 3: 21.8 msec per loop 10 loops, best of 3: 21.8 msec per loop 10 loops, best of 3: 21.9 msec per loop 10 loops, best of 3: 21.9 msec per loop
Python 2.5 as of last night: 100 loops, best of 3: 18 msec per loop 100 loops, best of 3: 18.4 msec per loop 100 loops, best of 3: 18.4 msec per loop 100 loops, best of 3: 18.2 msec per loop
The pybench 2.0 result:
PythonFunctionCalls: 130ms 108ms +21.3% 132ms 109ms +20.9%
Looks about right, I'd say.
If the pybench result is still 2.5 first, then the two results are contradictory - your timeit results are showing Python 2.5 as being faster (assuming the headings are on the right blocks of tests). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia --------------------------------------------------------------- http://www.boredomandlaziness.org
Nick Coghlan wrote:
M.-A. Lemburg wrote:
Still, here's the timeit.py measurement of the PythonFunctionCall test (note that I've scaled down the test in terms of number of rounds for timeit.py):
Python 2.5 as of last night:
10 loops, best of 3: 21.9 msec per loop 10 loops, best of 3: 21.8 msec per loop 10 loops, best of 3: 21.8 msec per loop 10 loops, best of 3: 21.9 msec per loop 10 loops, best of 3: 21.9 msec per loop
Python 2.4:
100 loops, best of 3: 18 msec per loop 100 loops, best of 3: 18.4 msec per loop 100 loops, best of 3: 18.4 msec per loop 100 loops, best of 3: 18.2 msec per loop
The pybench 2.0 result:
PythonFunctionCalls: 130ms 108ms +21.3% 132ms 109ms +20.9%
Looks about right, I'd say.
If the pybench result is still 2.5 first, then the two results are contradictory - your timeit results are showing Python 2.5 as being faster (assuming the headings are on the right blocks of tests).
<sigh> I put the headings for the timeit.py output on the wrong blocks. Thanks for pointing this out. Anyway, try for yourself. Just add these lines to pybench/Call.py at the end and then run Call.py using Python 2.4 vs. 2.5: ### Test to make Fredrik happy... if __name__ == '__main__': import timeit timeit.TestClass = PythonFunctionCalls timeit.main(['-s', 'test = TestClass(); test.rounds = 1000', 'test.test()']) -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Jun 08 2006)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
2006-07-03: EuroPython 2006, CERN, Switzerland 24 days left ::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::
Fredrik Lundh wrote:
M.-A. Lemburg wrote:
<sigh> I put the headings for the timeit.py output on the wrong blocks. Thanks for pointing this out.
so how do you explain the Try/Except results, where timeit and pybench seems to agree?
The pybench results match those of timeit.py on my test machine in both cases. I just mixed up the headers when I wrote the email. Here's the console print-out: Tools/pybench> ~/projects/Python/Installation/bin/python Calls.py 10 loops, best of 3: 21.8 msec per loop Tools/pybench> ~/projects/Python/Installation/bin/python Exceptions.py 100 loops, best of 3: 15.4 msec per loop Tools/pybench> ~/projects/Python/Installation/bin/python Python 2.5a2 (trunk, Jun 8 2006, 01:51:06) [GCC 3.3.4 (pre 3.3.5 20040809)] on linux2 Tools/pybench> python Calls.py 100 loops, best of 3: 18.2 msec per loop Tools/pybench> python Exceptions.py 100 loops, best of 3: 17 msec per loop Tools/pybench> python Python 2.4.2 (#1, Oct 1 2005, 15:24:35) [GCC 3.3.4 (pre 3.3.5 20040809)] on linux2 Calls.py is using timit.py against the PythonFunctionCalls test and Exceptions.py the TryRaiseExcept test. Function calls are slower in 2.5, while try-except in 2.5 is faster than 2.4. I've attached the Calls.py file below. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Jun 08 2006)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
2006-07-03: EuroPython 2006, CERN, Switzerland 24 days left ::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! :::: from pybench import Test class PythonFunctionCalls(Test): version = 2.0 operations = 5*(1+4+4+2) rounds = 60000 def test(self): global f,f1,g,h # define functions def f(): pass def f1(x): pass def g(a,b,c): return a,b,c def h(a,b,c,d=1,e=2,f=3): return d,e,f # do calls for i in xrange(self.rounds): f() f1(i) f1(i) f1(i) f1(i) g(i,i,i) g(i,i,i) g(i,i,i) g(i,i,i) h(i,i,3,i,i) h(i,i,i,2,i,3) f() f1(i) f1(i) f1(i) f1(i) g(i,i,i) g(i,i,i) g(i,i,i) g(i,i,i) h(i,i,3,i,i) h(i,i,i,2,i,3) f() f1(i) f1(i) f1(i) f1(i) g(i,i,i) g(i,i,i) g(i,i,i) g(i,i,i) h(i,i,3,i,i) h(i,i,i,2,i,3) f() f1(i) f1(i) f1(i) f1(i) g(i,i,i) g(i,i,i) g(i,i,i) g(i,i,i) h(i,i,3,i,i) h(i,i,i,2,i,3) f() f1(i) f1(i) f1(i) f1(i) g(i,i,i) g(i,i,i) g(i,i,i) g(i,i,i) h(i,i,3,i,i) h(i,i,i,2,i,3) def calibrate(self): global f,f1,g,h # define functions def f(): pass def f1(x): pass def g(a,b,c): return a,b,c def h(a,b,c,d=1,e=2,f=3): return d,e,f # do calls for i in xrange(self.rounds): pass ### class BuiltinFunctionCalls(Test): version = 2.0 operations = 5*(2+5+5+5) rounds = 60000 def test(self): # localize functions f0 = globals f1 = hash f2 = cmp f3 = range # do calls for i in xrange(self.rounds): f0() f0() f1(i) f1(i) f1(i) f1(i) f1(i) f2(1,2) f2(1,2) f2(1,2) f2(1,2) f2(1,2) f3(1,3,2) f3(1,3,2) f3(1,3,2) f3(1,3,2) f3(1,3,2) f0() f0() f1(i) f1(i) f1(i) f1(i) f1(i) f2(1,2) f2(1,2) f2(1,2) f2(1,2) f2(1,2) f3(1,3,2) f3(1,3,2) f3(1,3,2) f3(1,3,2) f3(1,3,2) f0() f0() f1(i) f1(i) f1(i) f1(i) f1(i) f2(1,2) f2(1,2) f2(1,2) f2(1,2) f2(1,2) f3(1,3,2) f3(1,3,2) f3(1,3,2) f3(1,3,2) f3(1,3,2) f0() f0() f1(i) f1(i) f1(i) f1(i) f1(i) f2(1,2) f2(1,2) f2(1,2) f2(1,2) f2(1,2) f3(1,3,2) f3(1,3,2) f3(1,3,2) f3(1,3,2) f3(1,3,2) f0() f0() f1(i) f1(i) f1(i) f1(i) f1(i) f2(1,2) f2(1,2) f2(1,2) f2(1,2) f2(1,2) f3(1,3,2) f3(1,3,2) f3(1,3,2) f3(1,3,2) f3(1,3,2) def calibrate(self): # localize functions f0 = dir f1 = hash f2 = range f3 = range # do calls for i in xrange(self.rounds): pass ### class PythonMethodCalls(Test): version = 2.0 operations = 5*(6 + 5 + 4) rounds = 30000 def test(self): class c: x = 2 s = 'string' def f(self): return self.x def j(self,a,b): self.y = a self.t = b return self.y def k(self,a,b,c=3): self.y = a self.s = b self.t = c o = c() for i in xrange(self.rounds): o.f() o.f() o.f() o.f() o.f() o.f() o.j(i,i) o.j(i,i) o.j(i,2) o.j(i,2) o.j(2,2) o.k(i,i) o.k(i,2) o.k(i,2,3) o.k(i,i,c=4) o.f() o.f() o.f() o.f() o.f() o.f() o.j(i,i) o.j(i,i) o.j(i,2) o.j(i,2) o.j(2,2) o.k(i,i) o.k(i,2) o.k(i,2,3) o.k(i,i,c=4) o.f() o.f() o.f() o.f() o.f() o.f() o.j(i,i) o.j(i,i) o.j(i,2) o.j(i,2) o.j(2,2) o.k(i,i) o.k(i,2) o.k(i,2,3) o.k(i,i,c=4) o.f() o.f() o.f() o.f() o.f() o.f() o.j(i,i) o.j(i,i) o.j(i,2) o.j(i,2) o.j(2,2) o.k(i,i) o.k(i,2) o.k(i,2,3) o.k(i,i,c=4) o.f() o.f() o.f() o.f() o.f() o.f() o.j(i,i) o.j(i,i) o.j(i,2) o.j(i,2) o.j(2,2) o.k(i,i) o.k(i,2) o.k(i,2,3) o.k(i,i,c=4) def calibrate(self): class c: x = 2 s = 'string' def f(self): return self.x def j(self,a,b): self.y = a self.t = b def k(self,a,b,c=3): self.y = a self.s = b self.t = c o = c for i in xrange(self.rounds): pass ### class Recursion(Test): version = 2.0 operations = 5 rounds = 100000 def test(self): global f def f(x): if x > 1: return f(x-1) return 1 for i in xrange(self.rounds): f(10) f(10) f(10) f(10) f(10) def calibrate(self): global f def f(x): if x > 0: return f(x-1) return 1 for i in xrange(self.rounds): pass ### Test to make Fredrik happy... if __name__ == '__main__': import timeit timeit.TestClass = PythonFunctionCalls timeit.main(['-s', 'test = TestClass(); test.rounds = 1000', 'test.test()'])
M.-A. Lemburg wrote:
The pybench results match those of timeit.py on my test machine in both cases. I just mixed up the headers when I wrote the email.
on a line by line basis ?
Testnames minimum run-time average run-time this other diff this other diff ------------------------------------------------------------------------------- PythonFunctionCalls: 130ms 108ms +21.3% 132ms 109ms +20.9% TryRaiseExcept: 115ms 137ms -16.2% 129ms 140ms -7.6%
</F>
Fredrik Lundh wrote:
M.-A. Lemburg wrote:
The pybench results match those of timeit.py on my test machine in both cases. I just mixed up the headers when I wrote the email.
on a line by line basis ?
No idea what you mean ? I posted the corrected version after Nick told me about the apparent mistake.
Testnames minimum run-time average run-time this other diff this other diff ------------------------------------------------------------------------------- PythonFunctionCalls: 130ms 108ms +21.3% 132ms 109ms +20.9% TryRaiseExcept: 115ms 137ms -16.2% 129ms 140ms -7.6%
(this=Python 2.5; other=Python2.4) timeit.py results (see my last email), converted to the pybench output: PythonFunctionCalls: 21.8ms 18.2ms +19.8% TryRaiseExcept: 15.4ms 17.0ms -9.4% The timeit.py results for TryRaiseExcept on Python 2.4 vary between 17.0 - 18.1ms. With Python 2.5 this doesn't happen. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Jun 08 2006)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
2006-07-03: EuroPython 2006, CERN, Switzerland 24 days left ::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::
M.-A. Lemburg wrote:
The pybench results match those of timeit.py on my test machine in both cases.
but they don't match the timeit results on similar machines, nor do they reflect what was done at the sprint.
Tools/pybench> ~/projects/Python/Installation/bin/python Calls.py 10 loops, best of 3: 21.8 msec per loop
10 loops ? what happens if you run the actual test code (the stuff inside the for loop) inside timeit, instead of running your test loop inside timeit? </F>
Fredrik Lundh wrote:
M.-A. Lemburg wrote:
The pybench results match those of timeit.py on my test machine in both cases.
but they don't match the timeit results on similar machines, nor do they reflect what was done at the sprint.
Huh ? They do show the speedups you achieved at the sprint.
Tools/pybench> ~/projects/Python/Installation/bin/python Calls.py 10 loops, best of 3: 21.8 msec per loop
10 loops ?
what happens if you run the actual test code (the stuff inside the for loop) inside timeit, instead of running your test loop inside timeit?
More or less the same results: Python 2.4: Tools/pybench> python Calls.py 100000 loops, best of 3: 18.9 usec per loop Tools/pybench> python Calls.py 100000 loops, best of 3: 18.8 usec per loop Tools/pybench> python Calls.py 100000 loops, best of 3: 18.7 usec per loop Python 2.5 (trunk as-of 2006-06-08): Tools/pybench> ~/projects/Python/Installation/bin/python Calls.py 10000 loops, best of 3: 22.9 usec per loop Tools/pybench> ~/projects/Python/Installation/bin/python Calls.py 10000 loops, best of 3: 23.7 usec per loop Tools/pybench> ~/projects/Python/Installation/bin/python Calls.py 10000 loops, best of 3: 23.4 usec per loop I've attached the modified Call.py below so that you can run it as well. All this on AMD64, Linux2.6, gcc3.3. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Jun 08 2006)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! :::: from pybench import Test class PythonFunctionCalls(Test): version = 2.0 operations = 5*(1+4+4+2) rounds = 60000 def test(self): global f,f1,g,h # define functions def f(): pass def f1(x): pass def g(a,b,c): return a,b,c def h(a,b,c,d=1,e=2,f=3): return d,e,f # do calls for i in xrange(self.rounds): f() f1(i) f1(i) f1(i) f1(i) g(i,i,i) g(i,i,i) g(i,i,i) g(i,i,i) h(i,i,3,i,i) h(i,i,i,2,i,3) f() f1(i) f1(i) f1(i) f1(i) g(i,i,i) g(i,i,i) g(i,i,i) g(i,i,i) h(i,i,3,i,i) h(i,i,i,2,i,3) f() f1(i) f1(i) f1(i) f1(i) g(i,i,i) g(i,i,i) g(i,i,i) g(i,i,i) h(i,i,3,i,i) h(i,i,i,2,i,3) f() f1(i) f1(i) f1(i) f1(i) g(i,i,i) g(i,i,i) g(i,i,i) g(i,i,i) h(i,i,3,i,i) h(i,i,i,2,i,3) f() f1(i) f1(i) f1(i) f1(i) g(i,i,i) g(i,i,i) g(i,i,i) g(i,i,i) h(i,i,3,i,i) h(i,i,i,2,i,3) def calibrate(self): global f,f1,g,h # define functions def f(): pass def f1(x): pass def g(a,b,c): return a,b,c def h(a,b,c,d=1,e=2,f=3): return d,e,f # do calls for i in xrange(self.rounds): pass ### class BuiltinFunctionCalls(Test): version = 2.0 operations = 5*(2+5+5+5) rounds = 60000 def test(self): # localize functions f0 = globals f1 = hash f2 = cmp f3 = range # do calls for i in xrange(self.rounds): f0() f0() f1(i) f1(i) f1(i) f1(i) f1(i) f2(1,2) f2(1,2) f2(1,2) f2(1,2) f2(1,2) f3(1,3,2) f3(1,3,2) f3(1,3,2) f3(1,3,2) f3(1,3,2) f0() f0() f1(i) f1(i) f1(i) f1(i) f1(i) f2(1,2) f2(1,2) f2(1,2) f2(1,2) f2(1,2) f3(1,3,2) f3(1,3,2) f3(1,3,2) f3(1,3,2) f3(1,3,2) f0() f0() f1(i) f1(i) f1(i) f1(i) f1(i) f2(1,2) f2(1,2) f2(1,2) f2(1,2) f2(1,2) f3(1,3,2) f3(1,3,2) f3(1,3,2) f3(1,3,2) f3(1,3,2) f0() f0() f1(i) f1(i) f1(i) f1(i) f1(i) f2(1,2) f2(1,2) f2(1,2) f2(1,2) f2(1,2) f3(1,3,2) f3(1,3,2) f3(1,3,2) f3(1,3,2) f3(1,3,2) f0() f0() f1(i) f1(i) f1(i) f1(i) f1(i) f2(1,2) f2(1,2) f2(1,2) f2(1,2) f2(1,2) f3(1,3,2) f3(1,3,2) f3(1,3,2) f3(1,3,2) f3(1,3,2) def calibrate(self): # localize functions f0 = dir f1 = hash f2 = range f3 = range # do calls for i in xrange(self.rounds): pass ### class PythonMethodCalls(Test): version = 2.0 operations = 5*(6 + 5 + 4) rounds = 30000 def test(self): class c: x = 2 s = 'string' def f(self): return self.x def j(self,a,b): self.y = a self.t = b return self.y def k(self,a,b,c=3): self.y = a self.s = b self.t = c o = c() for i in xrange(self.rounds): o.f() o.f() o.f() o.f() o.f() o.f() o.j(i,i) o.j(i,i) o.j(i,2) o.j(i,2) o.j(2,2) o.k(i,i) o.k(i,2) o.k(i,2,3) o.k(i,i,c=4) o.f() o.f() o.f() o.f() o.f() o.f() o.j(i,i) o.j(i,i) o.j(i,2) o.j(i,2) o.j(2,2) o.k(i,i) o.k(i,2) o.k(i,2,3) o.k(i,i,c=4) o.f() o.f() o.f() o.f() o.f() o.f() o.j(i,i) o.j(i,i) o.j(i,2) o.j(i,2) o.j(2,2) o.k(i,i) o.k(i,2) o.k(i,2,3) o.k(i,i,c=4) o.f() o.f() o.f() o.f() o.f() o.f() o.j(i,i) o.j(i,i) o.j(i,2) o.j(i,2) o.j(2,2) o.k(i,i) o.k(i,2) o.k(i,2,3) o.k(i,i,c=4) o.f() o.f() o.f() o.f() o.f() o.f() o.j(i,i) o.j(i,i) o.j(i,2) o.j(i,2) o.j(2,2) o.k(i,i) o.k(i,2) o.k(i,2,3) o.k(i,i,c=4) def calibrate(self): class c: x = 2 s = 'string' def f(self): return self.x def j(self,a,b): self.y = a self.t = b def k(self,a,b,c=3): self.y = a self.s = b self.t = c o = c for i in xrange(self.rounds): pass ### class Recursion(Test): version = 2.0 operations = 5 rounds = 100000 def test(self): global f def f(x): if x > 1: return f(x-1) return 1 for i in xrange(self.rounds): f(10) f(10) f(10) f(10) f(10) def calibrate(self): global f def f(x): if x > 0: return f(x-1) return 1 for i in xrange(self.rounds): pass ### Test to make Fredrik happy... if __name__ == '__main__': import timeit if 0: timeit.TestClass = PythonFunctionCalls timeit.main(['-s', 'test = TestClass(); test.rounds = 1000', 'test.test()']) else: setup = """\ global f,f1,g,h # define functions def f(): pass def f1(x): pass def g(a,b,c): return a,b,c def h(a,b,c,d=1,e=2,f=3): return d,e,f i = 1 """ test = """\ f() f1(i) f1(i) f1(i) f1(i) g(i,i,i) g(i,i,i) g(i,i,i) g(i,i,i) h(i,i,3,i,i) h(i,i,i,2,i,3) f() f1(i) f1(i) f1(i) f1(i) g(i,i,i) g(i,i,i) g(i,i,i) g(i,i,i) h(i,i,3,i,i) h(i,i,i,2,i,3) f() f1(i) f1(i) f1(i) f1(i) g(i,i,i) g(i,i,i) g(i,i,i) g(i,i,i) h(i,i,3,i,i) h(i,i,i,2,i,3) f() f1(i) f1(i) f1(i) f1(i) g(i,i,i) g(i,i,i) g(i,i,i) g(i,i,i) h(i,i,3,i,i) h(i,i,i,2,i,3) f() f1(i) f1(i) f1(i) f1(i) g(i,i,i) g(i,i,i) g(i,i,i) g(i,i,i) h(i,i,3,i,i) h(i,i,i,2,i,3) """ timeit.main(['-s', setup, test])
M.-A. Lemburg wrote:
Huh ? They do show the speedups you achieved at the sprint.
the results you just posted appear to show a 20% slowdown for function calls, and a 10% speedup for exceptions. both things were optimized at the sprint, and the improvements were confirmed on several machines. on my machine, using timeit on the test bodies, I get noticeable speedups for both tests; from my earlier post: function calls:
25 usec per loop for Python 2.4.3 22.5 usec per loop for Python 2.5 trunk
try/except:
15.1 usec per loop for Python 2.4.3 23.5 usec per loop for Python 2.5 alpha 2 11.6 usec per loop for Python 2.5 trunk
maybe the function call issue is an AMD64 thing? or a compiler thing? do you see the same result on other hardware? </F>
On 6/8/06, M.-A. Lemburg
All this on AMD64, Linux2.6, gcc3.3.
FWIW, my AMD64, linux 2.6, gcc 4.0 machine reports 29.0-29.5 usec for 2.5,
30.0-31.0 for 2.4 and 30.5-31.5 for 2.3, using the code you attached. In
other words, 2.5 is definately not slower here. At least, not if I use the
same compile options for 2.5 as for 2.4... ;-)
--
Thomas Wouters
Thomas Wouters wrote:
On 6/8/06, M.-A. Lemburg
wrote: All this on AMD64, Linux2.6, gcc3.3.
FWIW, my AMD64, linux 2.6, gcc 4.0 machine reports 29.0-29.5 usec for 2.5, 30.0-31.0 for 2.4 and 30.5-31.5 for 2.3, using the code you attached. In other words, 2.5 is definately not slower here. At least, not if I use the same compile options for 2.5 as for 2.4... ;-)
I checked, both Python versions were compiled using these options (and the same compiler): # Compiler options OPT= -DNDEBUG -g -O3 -Wall -Wstrict-prototypes BASECFLAGS= -fno-strict-aliasing Perhaps it's a new feature in gcc 4.0 that makes the slow-down I see turn into a speedup :-) -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Jun 08 2006)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::
On 6/8/06, M.-A. Lemburg
Perhaps it's a new feature in gcc 4.0 that makes the slow-down I see turn into a speedup :-)
It seems so. I tested with gcc 2.95, 3.3 and 4.0 on FreeBSD 4.10 (only
machine I had available with those gcc versions) and both 2.95 and 4.0 show
a 10-20% speedup of your testcase in 2.5 compared to 2.4. 3.3 showed a 10%
slowdown or so.
Test-more'ly y'rs,
--
Thomas Wouters
Thomas Wouters wrote:
On 6/8/06, *M.-A. Lemburg*
mailto:mal@egenix.com> wrote: Perhaps it's a new feature in gcc 4.0 that makes the slow-down I see turn into a speedup :-)
It seems so. I tested with gcc 2.95, 3.3 and 4.0 on FreeBSD 4.10 (only machine I had available with those gcc versions) and both 2.95 and 4.0 show a 10-20% speedup of your testcase in 2.5 compared to 2.4. 3.3 showed a 10% slowdown or so.
Does 4.0 show a general slowdown on your test machines? I saw a drop of average Pystones from 44000 to 40000 and from 42000 to 39000 on my boxes switching from GCC 3.4.6 to 4.1.1. Georg
On 6/8/06, Georg Brandl
Does 4.0 show a general slowdown on your test machines? I saw a drop of average Pystones from 44000 to 40000 and from 42000 to 39000 on my boxes switching from GCC 3.4.6 to 4.1.1.
Yep, looks like it does. Don't have time to run more extensive tests, though
(I'd have to make sure the machine is unloaded for a much longer period of
time, and I don't have the time :)
--
Thomas Wouters
Thomas Wouters wrote:
On 6/8/06, M.-A. Lemburg
wrote: Perhaps it's a new feature in gcc 4.0 that makes the slow-down I see turn into a speedup :-)
It seems so. I tested with gcc 2.95, 3.3 and 4.0 on FreeBSD 4.10 (only machine I had available with those gcc versions) and both 2.95 and 4.0 show a 10-20% speedup of your testcase in 2.5 compared to 2.4. 3.3 showed a 10% slowdown or so.
Test-more'ly y'rs,
Looks like it's time to upgrade to SuSE 10.1, then :-) Thanks for checking. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Jun 08 2006)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::
M.-A. Lemburg wrote:
Still, here's the timeit.py measurement of the PythonFunctionCall test (note that I've scaled down the test in terms of number of rounds for timeit.py):
Python 2.4: 10 loops, best of 3: 21.9 msec per loop 10 loops, best of 3: 21.8 msec per loop 10 loops, best of 3: 21.8 msec per loop 10 loops, best of 3: 21.9 msec per loop 10 loops, best of 3: 21.9 msec per loop
Python 2.5 as of last night: 100 loops, best of 3: 18 msec per loop 100 loops, best of 3: 18.4 msec per loop 100 loops, best of 3: 18.4 msec per loop 100 loops, best of 3: 18.2 msec per loop
The pybench 2.0 result:
PythonFunctionCalls: 130ms 108ms +21.3% 132ms 109ms +20.9%
Looks about right, I'd say.
except for the sign, I'd say. pybench reported a slowdown from 108 to 130 ms, which prompted you to write > A little disturbing is the slow-down for Python function calls > and the built-in method lookup. Did anything change in these parts > of the interpreter ? but timeit is reporting a ~20% speedup (21.8 to 18 ms). on my machine, using the loop body from Calls.PythonFunctionCalls.test as a separate global function called by timeit, I get: 25 usec per loop for Python 2.4.3 22.5 usec per loop for Python 2.5 trunk which seems to match your timeit results quite well. and we *did* speed up frame handling on the reykjavik sprint. another sprint optimization was exception handling, and pybench did notice this (137 to 115 ms). here's what timeit says on my machine: 15.1 usec per loop for Python 2.4.3 23.5 usec per loop for Python 2.5 alpha 2 11.6 usec per loop for Python 2.5 trunk something's not quite right... </F>
M.-A. Lemburg wrote:
The results were produced by pybench 2.0 and use time.time on Linux, plus a different calibration strategy. As a result these timings are a lot more repeatable than with pybench 1.3 and I've confirmed the timings using several runs to make sure.
can you check in 2.0 ? (if it's not quite ready for public consumption, put it in the sandbox). </F>
Fredrik Lundh wrote:
M.-A. Lemburg wrote:
The results were produced by pybench 2.0 and use time.time on Linux, plus a different calibration strategy. As a result these timings are a lot more repeatable than with pybench 1.3 and I've confirmed the timings using several runs to make sure.
can you check in 2.0 ? (if it's not quite ready for public consumption, put it in the sandbox).
I'll check it in once it's ready for prime-time, either later today or early next week. You can download a current snapshot from: http://www.egenix.com/files/python/pybench-2.0-2006-06-09.zip -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Jun 09 2006)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
2006-07-03: EuroPython 2006, CERN, Switzerland 23 days left ::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::
M.-A. Lemburg wrote:
You can download a current snapshot from:
http://www.egenix.com/files/python/pybench-2.0-2006-06-09.zip
believe it or not, but this hangs on my machine, under 2.5 trunk. and it hangs hard; nether control-c, break, or the task manager manages to kill it. if it's any clue, it prints
------------------------------------------------------------------------------- PYBENCH 2.0 ------------------------------------------------------------------------------- * using Python 2.5a2 * disabled garbage collection * system check interval set to maximum: 2147483647 * using timer: time.clock
and that's it; the process is just sitting there, using exactly 0% CPU. </F>
Fredrik Lundh wrote:
M.-A. Lemburg wrote:
You can download a current snapshot from:
http://www.egenix.com/files/python/pybench-2.0-2006-06-09.zip
believe it or not, but this hangs on my machine, under 2.5 trunk. and it hangs hard; nether control-c, break, or the task manager manages to kill it.
Weird.
if it's any clue, it prints
------------------------------------------------------------------------------- PYBENCH 2.0 ------------------------------------------------------------------------------- * using Python 2.5a2 * disabled garbage collection * system check interval set to maximum: 2147483647 * using timer: time.clock
and that's it; the process is just sitting there, using exactly 0% CPU.
This is the output to expect: ------------------------------------------------------------------------------- PYBENCH 2.0 ------------------------------------------------------------------------------- * using Python 2.4.2 * disabled garbage collection * system check interval set to maximum: 2147483647 * using timer: time.time Calibrating tests. Please wait... Running 10 round(s) of the suite at warp factor 10: * Round 1 done in 6.627 seconds. * Round 2 done in 7.307 seconds. * Round 3 done in 7.180 seconds. ... Note that the calibration step takes a while. Looking at the code, the only place where it could hang (because it's relying on a few external tools) is when fetching the platform details: def get_machine_details(): import platform buildno, builddate = platform.python_build() python = platform.python_version() if python > '2.0': try: unichr(100000) except ValueError: # UCS2 build (standard) unicode = 'UCS2' else: # UCS4 build (most recent Linux distros) unicode = 'UCS4' else: unicode = None bits, linkage = platform.architecture() return { 'platform': platform.platform(), 'processor': platform.processor(), 'executable': sys.executable, 'python': platform.python_version(), 'compiler': platform.python_compiler(), 'buildno': buildno, 'builddate': builddate, 'unicode': unicode, 'bits': bits, } It does run fine on my WinXP machine, both with the win32 package installed or not. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Jun 09 2006)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
2006-07-03: EuroPython 2006, CERN, Switzerland 23 days left ::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::
Fredrik, could you check whether the get_machine_details() function is causing the hand on your machine ? Does anyone else observe this as well ? I'm about to check in version 2.0 of pybench, but would like to get this resolved first, if possible. Thanks, -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Jun 13 2006)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
2006-07-03: EuroPython 2006, CERN, Switzerland 19 days left ::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! :::: M.-A. Lemburg wrote:
Fredrik Lundh wrote:
M.-A. Lemburg wrote:
You can download a current snapshot from:
http://www.egenix.com/files/python/pybench-2.0-2006-06-09.zip believe it or not, but this hangs on my machine, under 2.5 trunk. and it hangs hard; nether control-c, break, or the task manager manages to kill it.
Weird.
if it's any clue, it prints
------------------------------------------------------------------------------- PYBENCH 2.0 ------------------------------------------------------------------------------- * using Python 2.5a2 * disabled garbage collection * system check interval set to maximum: 2147483647 * using timer: time.clock and that's it; the process is just sitting there, using exactly 0% CPU.
This is the output to expect:
------------------------------------------------------------------------------- PYBENCH 2.0 ------------------------------------------------------------------------------- * using Python 2.4.2 * disabled garbage collection * system check interval set to maximum: 2147483647 * using timer: time.time
Calibrating tests. Please wait...
Running 10 round(s) of the suite at warp factor 10:
* Round 1 done in 6.627 seconds. * Round 2 done in 7.307 seconds. * Round 3 done in 7.180 seconds. ...
Note that the calibration step takes a while.
Looking at the code, the only place where it could hang (because it's relying on a few external tools) is when fetching the platform details:
def get_machine_details():
import platform buildno, builddate = platform.python_build() python = platform.python_version() if python > '2.0': try: unichr(100000) except ValueError: # UCS2 build (standard) unicode = 'UCS2' else: # UCS4 build (most recent Linux distros) unicode = 'UCS4' else: unicode = None bits, linkage = platform.architecture() return { 'platform': platform.platform(), 'processor': platform.processor(), 'executable': sys.executable, 'python': platform.python_version(), 'compiler': platform.python_compiler(), 'buildno': buildno, 'builddate': builddate, 'unicode': unicode, 'bits': bits, }
It does run fine on my WinXP machine, both with the win32 package installed or not.
FYI: I've just checked in pybench 2.0 under Tools/pybench/. Please give it a go and let me know whether the new calibration strategy and default timers result in better repeatability of the benchmark results. I've tested the release extensively on Windows and Linux and found that the test times are repeatable within +/- 5% for each test and under 1% for the whole suite. Note that this is an incompatible change to pybench. You will not be able to compare results from runs using pybench 1.x against results from pybench 2.0 (and pybench will refuse to display such diffs). Enjoy, -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Jun 13 2006)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::
[Fredrik Lundh]
... since process time is *sampled*, not measured, process time isn't exactly in- vulnerable either.
[Martin v. Löwis]
I can't share that view. The scheduler knows *exactly* what thread is running on the processor at any time, and that thread won't change until the scheduler makes it change. So if you discount time spent in interrupt handlers (which might be falsely accounted for the thread that happens to run at the point of the interrupt), then process time *is* measured, not sampled, on any modern operating system: it is updated whenever the scheduler schedules a different thread.
That doesn't seem to agree with, e.g., http://lwn.net/2001/0412/kernel.php3 under "No more jiffies?": ... Among other things, it imposes a 10ms resolution on most timing- related activities, which can make it hard for user-space programs that need a tighter control over time. It also guarantees that process accounting will be inaccurate. Over the course of one 10ms jiffy, several processes might have run, but the one actually on the CPU when the timer interrupt happens gets charged for the entire interval. Maybe this varies by Linux flavor or version? While the article above was published in 2001, Googling didn't turn up any hint that Linux jiffies have actually gone away, or become better loved, since then.
Tim Peters wrote:
Maybe this varies by Linux flavor or version? While the article above was published in 2001, Googling didn't turn up any hint that Linux jiffies have actually gone away, or become better loved, since then.
well, on x86, they have changed from 10 ms in 2.4 to 1 ms in early 2.6 releases and 4 ms in later 2.6 releases, but that's about it. (the code in my previous post was from a 2.6.17 development version, which, afaict, is about as up to date as you can be). note that the jiffy interrupt handler does use the TSC (or similar mechanism) to update the wall clock time, so it wouldn't be that hard to refactor the code to use it also for process accounting. but I suppose the devil is in the backwards-compatibility details. just setting the HZ value to something very large will probably not work very well... </F>
Tim Peters wrote:
then process time *is* measured, not sampled, on any modern operating system: it is updated whenever the scheduler schedules a different thread.
That doesn't seem to agree with, e.g.,
http://lwn.net/2001/0412/kernel.php3
under "No more jiffies?": [...]
Maybe this varies by Linux flavor or version?
No, Fredrik is right: Linux samples process time, instead of measuring it. That only proves it is not a modern operating system :-) I would still hope that Windows measures instead of sampling. Regards, Martin
M.-A. Lemburg wrote:
Fredrik Lundh wrote:
M.-A. Lemburg wrote:
Seriously, I've been using and running pybench for years and even though tweaks to the interpreter do sometimes result in speedups or slow-downs where you wouldn't expect them (due to the interpreter using the Python objects), they are reproducable and often enough have uncovered that optimizations in one area may well result in slow-downs in other areas.
Often enough the results are related to low-level features of the architecture you're using to run the code such as cache size, cache lines, number of registers in the CPU or on the FPU stack, etc. etc.
and that observation has never made you stop and think about whether there might be some problem with the benchmarking approach you're using?
The approach pybench is using is as follows:
* Run a calibration step which does the same as the actual test without the operation being tested (ie. call the function running the test, setup the for-loop, constant variables, etc.)
The calibration step is run multiple times and is used to calculate an average test overhead time.
I believe my recent changes now take the minimum time rather than computing an average, since the minimum seems to be the best reflection of achievable speed. I assumed that we wanted to measure achievable speed rather than average speed as our benchmark of performance.
* Run the actual test which runs the operation multiple times.
The test is then adjusted to make sure that the test overhead / test run ratio remains within reasonable bounds.
If needed, the operation code is repeated verbatim in the for-loop, to decrease the ratio.
* Repeat the above for each test in the suite
* Repeat the suite N number of rounds
* Calculate the average run time of all test runs in all rounds.
Again, we are now using the minimum value. The reasons are similar: if extraneous processes interfere with timings then we don't want that to be reflected in the given timings. That's why we now report "notional minimum round time", since it's highly unlikely that any specific test round will give the minimum time for all tests. Even with these changes we still see some disturbing variations in timing both on Windows and on Unix-like platforms.
after all, if a change to e.g. the try/except code slows things down or speed things up, is it really reasonable to expect that the time it takes to convert Unicode strings to uppercase should suddenly change due to cache effects or a changing number of registers in the CPU? real hardware doesn't work that way...
Of course, but then changes to try-except logic can interfere with the performance of setting up method calls. This is what pybench then uncovers.
The only problem I see in the above approach is the way calibration is done. The run-time of the calibration code may be to small w/r to the resolution of the used timers.
Again, please provide the parameters you've used to run the test case and the output. Things like warp factor, overhead, etc. could hint to the problem you're seeing.
is PyBench perhaps using the following approach:
T = set of tests for N in range(number of test runs): for t in T: t0 = get_process_time() t() t1 = get_process_time() assign t1 - t0 to test t print assigned time
where t1 - t0 is very short?
See above (or the code in pybench.py). t1-t0 is usually around 20-50 seconds:
""" The tests must set .rounds to a value high enough to let the test run between 20-50 seconds. This is needed because clock()-timing only gives rather inaccurate values (on Linux, for example, it is accurate to a few hundreths of a second). If you don't want to wait that long, use a warp factor larger than 1. """
First, I'm not sure that this is the case for the default test parameters on modern machines. On my current laptop, for example, I see a round time of roughly four seconds and a notional minimum round time of 3.663 seconds. Secondly, while this recommendation may be very sensible, with 50 individual tests a decrease in the warp factor to 1 (the default is currently 20) isn't sufficient to increase individual test times to your recommended value, and decreasing the warp factor tends also to decrease reliability and repeatability. Thirdly, since each round of the suite at warp factor 1 takes between 80 and 90 seconds, pybench run this way isn't something one can usefully use to quickly evaluate the impact of a single change - particularly since even continuing development work on the benchmark machine potentially affects the benchmark results in unknown ways.
that's not a very good idea, given how get_process_time tends to be implemented on current-era systems (google for "jiffies")... but it definitely explains the bogus subtest results I'm seeing, and the "magic hardware" behaviour you're seeing.
That's exactly the reason why tests run for a relatively long time - to minimize these effects. Of course, using wall time make this approach vulnerable to other effects such as current load of the system, other processes having a higher priority interfering with the timed process, etc.
For this reason, I'm currently looking for ways to measure the process time on Windows.
I wish you luck with this search, as we clearly do need to improve repeatability of pybench results across all platforms, and particularly on Windows. regards Steve -- Steve Holden +44 150 684 7255 +1 800 494 3119 Holden Web LLC/Ltd http://www.holdenweb.com Love me, love my blog http://holdenweb.blogspot.com Recent Ramblings http://del.icio.us/steve.holden
participants (16)
-
"Martin v. Löwis"
-
A.M. Kuchling
-
Andrew Dalke
-
Fredrik Lundh
-
Georg Brandl
-
Greg Ewing
-
Josiah Carlson
-
M.-A. Lemburg
-
Michael Hudson
-
Nick Coghlan
-
Niko Matsakis
-
skip@pobox.com
-
Steve Holden
-
Terry Reedy
-
Thomas Wouters
-
Tim Peters