Re: [Speed] Getting the project off the ground
What I would like is a way to have hg bisect automated for me. So I could induicate a fast build, and a later build that is slow, and then say 'find me the first build in between where things slowed down. I want to be able to ask for this, go to bed, and find out when I wake up the next day exactly whre to go looking for bugs/slowdowns. I realise that this does not have to be done through the codespeak interface, and that in some cases it may be hard to tell the difference between a real slowdown, random noise, and the ever popular 'somebody was trying to compile pypy on tannit when the benchmarks were being run' though I suppose with the new machine we can work out a mechanism for preventing the last. But I still think this would be a neat feature for codespeed to have as a feature.
Laura
to be done through the codespeak interface, and that in some cases it may be hard to tell the difference between a real slowdown, random noise, and the ever popular 'somebody was trying to compile pypy on tannit when the benchmarks were being run' though I suppose with the new machine we can work out a mechanism for preventing the last. But I still think this would be a neat feature for codespeed to have as a feature.
cgroups and cpuset support could help with this, you can isolate one cpu purely for benchmarking and another for user use (translation and testing for example). you can also avoid some issues such as cache contention (which fijal mentions every time i mention doing multiple translations at once) by splittign the cpus up so that each group gets 4 cpus belonging to one socket
stick all processes in one cgroup and limit the cpuset (usergroup), then create another cgroup and give it the remaining cores on the other socket (benchgroup), by default all processes will go to the parent cgroup, in this case, usergroup as ou put all the PID's in there. thenyou can just put the benchmark script in a wrapper to grab its own pid and echo that pid into the benchgroup magic file to have the benchamrks run on the other cpu
using the above you can also reserve memory for benchmarking/transtlation
if anyone wants more details feel free to ask
On 06/07/11 20:47, Da_Blitz wrote:
cgroups and cpuset support could help with this, you can isolate one cpu purely for benchmarking and another for user use (translation and testing for example).
that would be really cool. However, I'm not sure it's enough to prevent "random noise".
E.g., I know that modern CPUs have a mechanism that turns off the unused cores in order to speed up the active ones (by slightly incrementing the frequency). If this is the case, then I fear that the only way to have a precise benchmark is to run only one process at time.
ciao, Anto
On Wed, Jul 6, 2011 at 12:01 PM, Antonio Cuni <anto.cuni@gmail.com> wrote:
that would be really cool. However, I'm not sure it's enough to prevent "random noise".
A little random noise is unavoidable. Sometimes the kernel will preempt the thread, so the kernel can do kernel things. There's no practical way around that.
A little random noise is okay, though. Take several measurements and compute confidence intervals.
-- Daniel Stutzbach
On Wed, Jul 6, 2011 at 3:01 PM, Antonio Cuni <anto.cuni@gmail.com> wrote:
On 06/07/11 20:47, Da_Blitz wrote:
cgroups and cpuset support could help with this, you can isolate one cpu purely for benchmarking and another for user use (translation and testing for example).
that would be really cool. However, I'm not sure it's enough to prevent "random noise".
E.g., I know that modern CPUs have a mechanism that turns off the unused cores in order to speed up the active ones (by slightly incrementing the frequency). If this is the case, then I fear that the only way to have a precise benchmark is to run only one process at time.
sudo -s "echo performance > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor"
Geremy Condra
On Wed, Jul 06, 2011 at 09:01:20PM +0200, Antonio Cuni wrote:
that would be really cool. However, I'm not sure it's enough to prevent "random noise".
E.g., I know that modern CPUs have a mechanism that turns off the unused cores in order to speed up the active ones (by slightly incrementing the frequency). If this is the case, then I fear that the only way to have a precise benchmark is to run only one process at time.
Looks like a X5680 cpu so yes it has the turbo boost feature (http://en.wikipedia.org/wiki/Nehalem_(microarchitecture)#Server_.2F_Desktop_...) i am running an i7 for building pypy and have the same turbo boost feature and in practice have not found it to be an issue aslong as you are only running one translation at a time. as the workload is single threadded it ramps up nicely. a lock or two should prevent the turbo boost enabling/disabling erratic but it is also under kernel control.
i havent investigated how much control the kernel has over it but i assume if you switch the cpu speed governers from performance over to user mode and manually set the freqency that should not be much of an issue
turbo mode is socket specific so the isolation i talked about in my last post would prevent compiles' from affecting the cpu freqency on the benchmark cpus
as a side note this may be a good time for my to push for making the benchmark scores user time + kernel time instead of wall clock time
On 06/07/11 21:11, Da_Blitz wrote:
Looks like a X5680 cpu so yes it has the turbo boost feature (http://en.wikipedia.org/wiki/Nehalem_(microarchitecture)#Server_.2F_Desktop_...) i am running an i7 for building pypy and have the same turbo boost feature and in practice have not found it to be an issue aslong as you are only running one translation at a time. as the workload is single threadded it ramps up nicely. a lock or two should prevent the turbo boost enabling/disabling erratic but it is also under kernel control.
i havent investigated how much control the kernel has over it but i assume if you switch the cpu speed governers from performance over to user mode and manually set the freqency that should not be much of an issue
I did some benchmarks, trying to understand how much the turbo boost and/or scaling governors affect the performance and, most importantly, if/how much they affect the standard deviation. Since we are talking of benchmarks, the smaller the standard deviation is, the better.
I ran the benchmark on Linux on an Intel i7 920 CPU, which has 4 physical cores (8 logical ones with hyperthreading, but we do not want to use them).
The benchmark consisted in running richards.py (one of the benchmarks we use in PyPy) using 1, 2, 3 or 4 cores at the same time. I used "taskset" to set the cpu affinity to a specific core.
For each number of core, I ran the benchmark 10 times in a row, and the measured the average time spent and the standard deviation (so, with 4 cores I had a total of 40 runs).
If the turbo boost theory is true, we expect the benchmarks to be slower when we run 4 in parallel.
Here are the results:
1 core: AVG: 1.939 seconds STDEV: 0.016 seconds
2 cores: AVG: 2.020 STDEV: 0.013
3 cores: AVG: 2.022 STDEV: 0.016
4 cores: AVG: 2.033 STDEV: 0.023
We can see that with 4 cores performance drops a bit, but not much (~4% between 1 core and 4 cores).
This is using the "ondemand" governor, which is the default on my system.
I tried to run it also with "performance", which in theory should give better performance and smaller stdev, but in practice it does not (I don't know why, honestly):
1 core, performance governor: AVG: 1.961 seconds STDEV: 0.027 seconds
I also tried to manually set the CPU to the lowest possible frequency, but got even worse results:
1 core, slowest frequency: AVG: 3.532 seconds STDEV: 0.042 seconds
turbo mode is socket specific so the isolation i talked about in my last post would prevent compiles' from affecting the cpu freqency on the benchmark cpus
Although the turbo boost seems not to affect much the performance, I agree with Da_Blitz in that it makes sense to reserve 1 socket (i.e. 6 cores) for benchmarks. Then, we can use the other 6 cores for general usage (e.g., running tests or translations). But, before doing this we should check that using one socket does not actually affect the performance of the other. As usual, I don't trust the theory too much :-).
ciao, Anto
On Thu, Jul 7, 2011 at 7:56 AM, Antonio Cuni <anto.cuni@gmail.com> wrote:
This is using the "ondemand" governor, which is the default on my system.
I tried to run it also with "performance", which in theory should give better performance and smaller stdev, but in practice it does not (I don't know why, honestly):
For what it's worth, I ran some similar experiments last year and found that "performance" yielded much lower standard deviations for certain performance tests, compared to "on demand". I no longer recall which ones unfortunately, but in a nutshell "on demand" is just as good only when it's smart enough detect that it should be using the maximum CPU frequency.
Sometimes it doesn't, and the standard deviation goes up a few orders of magnitude.
-- Daniel Stutzbach
participants (5)
-
Antonio Cuni -
Da_Blitz -
Daniel Stutzbach -
geremy condra -
Laura Creighton