Profiling (was GSoC : Performance parity between numpy arrays and Python scalars)
* Re: the profiling, I wrote a full oprofile->callgrind format script years ago: http://vorpus.org/~njs/op2calltree.py Haven't used it in years either but neither oprofile nor kcachegrind are terribly fast-moving projects so it's probably still working, or could be made so without much work. Or easier is to use the gperftools CPU profiler: https://gperftools.googlecode.com/svn/trunk/doc/cpuprofile.html
I don't have experience with gperftools, but on recent linux kernels, you can also use perf, which can't be made easier to use (no runtime support needed), but you need a 'recent' kernel http://indico.cern.ch/getFile.py/access?contribId=20&sessionId=4&resId=0&materialId=slides&confId=141309 I am hoping to talk a bit about those for our diving into numpy c code tutorial in June, what's the + of gperf in your opinion ? David
On Thu, May 2, 2013 at 9:25 AM, David Cournapeau <cournape@gmail.com> wrote:
* Re: the profiling, I wrote a full oprofile->callgrind format script years ago: http://vorpus.org/~njs/op2calltree.py Haven't used it in years either but neither oprofile nor kcachegrind are terribly fast-moving projects so it's probably still working, or could be made so without much work. Or easier is to use the gperftools CPU profiler: https://gperftools.googlecode.com/svn/trunk/doc/cpuprofile.html
I don't have experience with gperftools, but on recent linux kernels, you can also use perf, which can't be made easier to use (no runtime support needed), but you need a 'recent' kernel
I am hoping to talk a bit about those for our diving into numpy c code tutorial in June, what's the + of gperf in your opinion ?
For what I've used profiling for, THE key feature is proper callgraph support ("show me the *total* time spent in each function, including callees"). Otherwise, silly example, let's say you have a bug where you wrote: func1() { for (i = 0; i < 10000000; i++) foo = add(foo, bar[0]) } Obviously this is a waste of time, since you're actually performing the same operation over and over. Many profilers, given this, will tell you that all the time is spent in 'add', which is useless, because you don't want to speed up 'add', you want to speed up 'func1' (probably by not calling 'add' so many times!). If you have relatively flat code like most kernel code this isn't an issue, but I generally don't... perf is a fabulous framework and doesn't have any way to get full callgraph information out so IME it's been useless. They have reporting modes that claim to (like some "fractal" thing?) but AFAI been able to tell from docs/googling/mailing lists, there is nobody who understands how to interpret this output except the people who wrote it. Really a shame that it falls down in the last mile like that, hopefully they will fix this soon. callgrind has the *fabulous* kcachegrind front-end, but it only measures memory access performance on a simulated machine, which is very useful sometimes (if you're trying to optimize cache locality), but there's no guarantee that the bottlenecks on its simulated machine are the same as the bottlenecks on your real machine. oprofile is getting long in the tooth (superseded by perf), and it's built-in reporting tools are merely ok, but it does have full callgraph information and with the script above you can get the output into kcachegrind. perftools don't have all the fancy features of the in-kernel options, but they're trivial to use, and their reporting options are genuinely useful (though not quite as awesome as kcachegrind). So while in theory it's the least whizz-bang awesome of all of these options, in practice I find it the most useful. (Also, beware of terminology collision, "gperf" is something else again...) -n
On 5/2/13 3:58 PM, Nathaniel Smith wrote:
callgrind has the *fabulous* kcachegrind front-end, but it only measures memory access performance on a simulated machine, which is very useful sometimes (if you're trying to optimize cache locality), but there's no guarantee that the bottlenecks on its simulated machine are the same as the bottlenecks on your real machine.
Agreed, there is no guarantee, but my experience is that kcachegrind normally gives you a pretty decent view of cache faults and hence it can do pretty good predictions on how this affects your computations. I have used this feature extensively for optimizing parts of the Blosc compressor, and I cannot be more happier (to the point that, if it were not for Valgrind, I could not figure out many interesting memory access optimizations). -- Francesc Alted
On Thu, May 2, 2013 at 10:51 AM, Francesc Alted <francesc@continuum.io> wrote:
On 5/2/13 3:58 PM, Nathaniel Smith wrote:
callgrind has the *fabulous* kcachegrind front-end, but it only measures memory access performance on a simulated machine, which is very useful sometimes (if you're trying to optimize cache locality), but there's no guarantee that the bottlenecks on its simulated machine are the same as the bottlenecks on your real machine.
Agreed, there is no guarantee, but my experience is that kcachegrind normally gives you a pretty decent view of cache faults and hence it can do pretty good predictions on how this affects your computations. I have used this feature extensively for optimizing parts of the Blosc compressor, and I cannot be more happier (to the point that, if it were not for Valgrind, I could not figure out many interesting memory access optimizations).
Right -- if you have code where you know that memory is the bottleneck (so esp. integer-heavy code), then callgrind is perfect. In fact it was originally written to make it easier to optimize the bzip2 compressor :-). My point isn't that it's not useful, just, it's a little more of a specialist tool, so I hesitate to recommend it as the first profiler for people to reach for. An extreme example would be, last time I played with this, I found that for numpy scalar float64 * float64, 50% of the total time was in fiddling with floating point control registers. But that time would be invisible to callgrind's measurements... -n
On Thu, May 2, 2013 at 2:58 PM, Nathaniel Smith <njs@pobox.com> wrote:
On Thu, May 2, 2013 at 9:25 AM, David Cournapeau <cournape@gmail.com> wrote:
* Re: the profiling, I wrote a full oprofile->callgrind format script years ago: http://vorpus.org/~njs/op2calltree.py Haven't used it in years either but neither oprofile nor kcachegrind are terribly fast-moving projects so it's probably still working, or could be made so without much work. Or easier is to use the gperftools CPU profiler: https://gperftools.googlecode.com/svn/trunk/doc/cpuprofile.html
I don't have experience with gperftools, but on recent linux kernels, you can also use perf, which can't be made easier to use (no runtime support needed), but you need a 'recent' kernel
I am hoping to talk a bit about those for our diving into numpy c code tutorial in June, what's the + of gperf in your opinion ?
For what I've used profiling for, THE key feature is proper callgraph support ("show me the *total* time spent in each function, including callees"). Otherwise, silly example, let's say you have a bug where you wrote:
func1() { for (i = 0; i < 10000000; i++) foo = add(foo, bar[0]) }
Obviously this is a waste of time, since you're actually performing the same operation over and over. Many profilers, given this, will tell you that all the time is spent in 'add', which is useless, because you don't want to speed up 'add', you want to speed up 'func1' (probably by not calling 'add' so many times!). If you have relatively flat code like most kernel code this isn't an issue, but I generally don't...
perf is a fabulous framework and doesn't have any way to get full callgraph information out so IME it's been useless. They have reporting modes that claim to (like some "fractal" thing?) but AFAI been able to tell from docs/googling/mailing lists, there is nobody who understands how to interpret this output except the people who wrote it. Really a shame that it falls down in the last mile like that, hopefully they will fix this soon.
Perf doc is written for Vulcan, but it does what I think you want, say: void work(int n) { volatile int i=0; //don't optimize away while(i++ < n); } void easy() { work(1000 * 1000 * 50); } void hard() { work(1000*1000*1000); } int main() { easy(); hard(); } compile with gcc -g -O0, and then: perf record -g -a -- ./a.out perf report -g -a --stdio gives me 95.22% a.out a.out [.] work | --- work | |--89.84%-- hard | main | __libc_start_main | --5.38%-- easy main __libc_start_main or maybe even better with the -G option 95.22% a.out a.out [.] work | --- __libc_start_main main | |--94.35%-- hard | work | --5.65%-- easy work I agree it would be better with kcachgrind-like GUI, but it looks like it is coming. I don't really understand the ncurses frontend to the call graph. (example taken/adapted from the great Yosefk blog: http://www.yosefk.com/blog/how-profilers-lie-the-cases-of-gprof-and-kcachegr...) cheers, David
On Sun, May 5, 2013 at 5:57 PM, David Cournapeau <cournape@gmail.com> wrote:
perf is a fabulous framework and doesn't have any way to get full callgraph information out so IME it's been useless. They have reporting modes that claim to (like some "fractal" thing?) but AFAI been able to tell from docs/googling/mailing lists, there is nobody who understands how to interpret this output except the people who wrote it. Really a shame that it falls down in the last mile like that, hopefully they will fix this soon.
Perf doc is written for Vulcan, but it does what I think you want, say:
void work(int n) { volatile int i=0; //don't optimize away while(i++ < n); } void easy() { work(1000 * 1000 * 50); } void hard() { work(1000*1000*1000); } int main() { easy(); hard(); }
compile with gcc -g -O0, and then:
perf record -g -a -- ./a.out perf report -g -a --stdio
gives me
95.22% a.out a.out [.] work | --- work | |--89.84%-- hard | main | __libc_start_main | --5.38%-- easy main __libc_start_main
or maybe even better with the -G option
95.22% a.out a.out [.] work | --- __libc_start_main main | |--94.35%-- hard | work | --5.65%-- easy work
Yeah I've seen these displays before and I can see the information is there, and (knowing the code you ran) that somehow the first number has to do with the time spent under 'hard' and the second to do with time spent under 'easy', but I have no idea how to generalize this to arbitrary samples of these output formats. That's what I meant. -n
I did profiling for $python -m timeit -n 1000000 -s 'import numpy as np;x = np.asarray(1.0)' 'x+x' with oprofilier, and used gprof2dot.py to create callgraph, but I got graph[1] which doesn't create any meaning. I tried to use pprof, but I can not find profiles to be used. like ls.prof in "pprof /bin/ls ls.prof" Does any one has used pprof or any other profiler for generating callgraph? [1] https://docs.google.com/file/d/0B3Pqyp8kuQw0V042bmUwNHktZHM/edit On Mon, May 6, 2013 at 4:56 AM, Nathaniel Smith <njs@pobox.com> wrote:
On Sun, May 5, 2013 at 5:57 PM, David Cournapeau <cournape@gmail.com> wrote:
perf is a fabulous framework and doesn't have any way to get full callgraph information out so IME it's been useless. They have reporting modes that claim to (like some "fractal" thing?) but AFAI been able to tell from docs/googling/mailing lists, there is nobody who understands how to interpret this output except the people who wrote it. Really a shame that it falls down in the last mile like that, hopefully they will fix this soon.
Perf doc is written for Vulcan, but it does what I think you want, say:
void work(int n) { volatile int i=0; //don't optimize away while(i++ < n); } void easy() { work(1000 * 1000 * 50); } void hard() { work(1000*1000*1000); } int main() { easy(); hard(); }
compile with gcc -g -O0, and then:
perf record -g -a -- ./a.out perf report -g -a --stdio
gives me
95.22% a.out a.out [.] work | --- work | |--89.84%-- hard | main | __libc_start_main | --5.38%-- easy main __libc_start_main
or maybe even better with the -G option
95.22% a.out a.out [.] work | --- __libc_start_main main | |--94.35%-- hard | work | --5.65%-- easy work
Yeah I've seen these displays before and I can see the information is there, and (knowing the code you ran) that somehow the first number has to do with the time spent under 'hard' and the second to do with time spent under 'easy', but I have no idea how to generalize this to arbitrary samples of these output formats. That's what I meant.
-n _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
-- Arink Verma Indian Institute of Technology www.arinkverma.in
On Sat, Jun 8, 2013 at 3:15 AM, Arink Verma <arinkverma@gmail.com> wrote:
I tried to use pprof, but I can not find profiles to be used. like ls.prof in "pprof /bin/ls ls.prof"
You're looking for the ProfilerStart/ProfilerStop functions, the former takes a filename to write the profiler to (like "ls.prof" or "x-plus-x.prof"): http://www.mail-archive.com/numpy-discussion@scipy.org/msg41451.html -n
You're looking for the ProfilerStart/ProfilerStop functions, the former takes a filename to write the profiler to (like "ls.prof" or "x-plus-x.prof"): http://www.mail-archive.com/numpy-discussion@scipy.org/msg41451.html
I followed that and able to get a callgraph[1], but it doesnt contains all the edges. [1] https://docs.google.com/file/d/0B3Pqyp8kuQw0eVM4ajVYLWtfZVU/edit?usp=sharing Arink Verma www.arinkverma.in
On 14 Jun 2013 09:18, "Arink Verma" <arinkverma@gmail.com> wrote:
You're looking for the ProfilerStart/ProfilerStop functions, the former takes a filename to write the profiler to (like "ls.prof" or "x-plus-x.prof"): http://www.mail-archive.com/numpy-discussion@scipy.org/msg41451.html
I followed that and able to get a callgraph[1], but it doesnt contains
all the edges.
[1]
https://docs.google.com/file/d/0B3Pqyp8kuQw0eVM4ajVYLWtfZVU/edit?usp=sharing It's a probabilistic sampling profiler, so if it doesn't have enough samples then it can miss things. 227 samples is way way too low. You need to run the profiled code for longer (a few seconds at least), and if that's not enough then maybe increase the sampling rate too (though be careful because setting this too high can also add noise). -n
Hi Nathaniel It's a probabilistic sampling profiler, so if it doesn't have enough
samples then it can miss things. 227 samples is way way too low. You need to run the profiled code for longer (a few seconds at least), and if that's not enough then maybe increase the sampling rate too (though be careful because setting this too high can also add noise).
I ran code '1000000000' times, which record 229115 samples. Callgraph[1] generated is converging to *PyArray_DESCR*, and rest are unconnected. Does it mean anything? [1] https://docs.google.com/file/d/0B3Pqyp8kuQw0MzRoTTNVbmlaNFE/edit?usp=sharing -- Arink Verma www.arinkverma.in
On Mon, Jun 17, 2013 at 9:29 AM, Arink Verma <arinkverma@gmail.com> wrote:
Hi Nathaniel
It's a probabilistic sampling profiler, so if it doesn't have enough
samples then it can miss things. 227 samples is way way too low. You need to run the profiled code for longer (a few seconds at least), and if that's not enough then maybe increase the sampling rate too (though be careful because setting this too high can also add noise).
I ran code '1000000000' times, which record 229115 samples. Callgraph[1] generated is converging to *PyArray_DESCR*, and rest are unconnected.
Not sure what you are profiling. The PyArray_DESCR call just returns a pointer to the descr contained in an ndarray instance, so probably has little relevance here. Chuck
Not sure what you are profiling. The PyArray_DESCR call just returns a pointer to the descr contained in an ndarray instance, so probably has little relevance here.
I am profiling following code timeit.timeit('x+y',number=1000000000,setup='import numpy as np;x = np.asarray(1.0);y = np.asarray(2.0);') Arink Verma www.arinkverma.in
On Mon, Jun 17, 2013 at 4:29 PM, Arink Verma <arinkverma@gmail.com> wrote:
Hi Nathaniel
It's a probabilistic sampling profiler, so if it doesn't have enough samples then it can miss things. 227 samples is way way too low. You need to run the profiled code for longer (a few seconds at least), and if that's not enough then maybe increase the sampling rate too (though be careful because setting this too high can also add noise).
I ran code '1000000000' times, which record 229115 samples. Callgraph[1] generated is converging to PyArray_DESCR, and rest are unconnected.
Does it mean anything? [1]https://docs.google.com/file/d/0B3Pqyp8kuQw0MzRoTTNVbmlaNFE/edit?usp=sharing
I think it means that pprof is failing to walk to the stack to compute callers. That's consistent with PyArray_DESCR being the only "call" that it can find, because PyArray_DESCR isn't a real function, it always gets inlined, so detecting the caller doesn't require walking the stack. Is your numpy compiled with -fomit-frame-pointer or something like that? Any other weird build options used while building it? Is your binary stripped? If you're using a distro version, do you have the debug symbols installed? Did you compile this numpy yourself? (If not, the simplest thing to do would be to just build it yourself using the default settings and see if that helps.) What OS are you on? When I run your profiling code (but being lazier and only running 12000 samples), then do 'google-pprof --pdf /path/to/python /path/to/my/profile.prof', then I get the graph attached below. -n
I am building numpy from source, python setup.py build --fcompiler=gnu95 then installation, python setup.py install --user, on ubuntu 13.04 for analysis results pprof --svg /usr/bin/python py.prof On Mon, Jun 17, 2013 at 10:04 PM, Nathaniel Smith <njs@pobox.com> wrote:
Hi Nathaniel
It's a probabilistic sampling profiler, so if it doesn't have enough samples then it can miss things. 227 samples is way way too low. You need to run the profiled code for longer (a few seconds at least), and if
On Mon, Jun 17, 2013 at 4:29 PM, Arink Verma <arinkverma@gmail.com> wrote: that's not
enough then maybe increase the sampling rate too (though be careful because setting this too high can also add noise).
I ran code '1000000000' times, which record 229115 samples. Callgraph[1] generated is converging to PyArray_DESCR, and rest are unconnected.
Does it mean anything? [1] https://docs.google.com/file/d/0B3Pqyp8kuQw0MzRoTTNVbmlaNFE/edit?usp=sharing
I think it means that pprof is failing to walk to the stack to compute callers. That's consistent with PyArray_DESCR being the only "call" that it can find, because PyArray_DESCR isn't a real function, it always gets inlined, so detecting the caller doesn't require walking the stack.
Is your numpy compiled with -fomit-frame-pointer or something like that? Any other weird build options used while building it? Is your binary stripped? If you're using a distro version, do you have the debug symbols installed? Did you compile this numpy yourself? (If not, the simplest thing to do would be to just build it yourself using the default settings and see if that helps.) What OS are you on?
When I run your profiling code (but being lazier and only running 12000 samples), then do 'google-pprof --pdf /path/to/python /path/to/my/profile.prof', then I get the graph attached below.
-n
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
-- Arink Verma www.arinkverma.in
On Mon, Jun 17, 2013 at 4:06 PM, Arink Verma <arinkverma@gmail.com> wrote:
I am building numpy from source, python setup.py build --fcompiler=gnu95 then installation, python setup.py install --user, on ubuntu 13.04
for analysis results pprof --svg /usr/bin/python py.prof
You can try using bento, with something like: CFLAGS="-O0 -fno-omit-frame-pointer -g" bentomaker build -i -v -j4 There is a bit of overhead to set it up, though. David
On 18 Jun 2013 12:40, "David Cournapeau" <cournape@gmail.com> wrote:
On Mon, Jun 17, 2013 at 4:06 PM, Arink Verma <arinkverma@gmail.com> wrote:
I am building numpy from source, python setup.py build --fcompiler=gnu95 then installation, python setup.py install --user, on ubuntu 13.04
for analysis results pprof --svg /usr/bin/python py.prof
You can try using bento, with something like:
CFLAGS="-O0 -fno-omit-frame-pointer -g" bentomaker build -i -v -j4
A good thing to try (and unfortunately I don't have time right now to try and work out why my results are different than yours). Remember though that once you get it working you'll want to re-enable optimization (-O2 or -O3 instead of -O0) before you use the profile results for anything serious. -n
On Tue, Jun 18, 2013 at 7:26 AM, Nathaniel Smith <njs@pobox.com> wrote:
On 18 Jun 2013 12:40, "David Cournapeau" <cournape@gmail.com> wrote:
On Mon, Jun 17, 2013 at 4:06 PM, Arink Verma <arinkverma@gmail.com>
wrote:
I am building numpy from source, python setup.py build --fcompiler=gnu95 then installation, python setup.py install --user, on ubuntu 13.04
for analysis results pprof --svg /usr/bin/python py.prof
You can try using bento, with something like:
CFLAGS="-O0 -fno-omit-frame-pointer -g" bentomaker build -i -v -j4
A good thing to try (and unfortunately I don't have time right now to try and work out why my results are different than yours). Remember though that once you get it working you'll want to re-enable optimization (-O2 or -O3 instead of -O0) before you use the profile results for anything serious.
With the above compilation options, I get a flame graph that show similar patterns to what you get as well: https://s3.amazonaws.com/scipy-2013/diving_into_numpy/perf-numpy.svg
participants (5)
-
Arink Verma
-
Charles R Harris
-
David Cournapeau
-
Francesc Alted
-
Nathaniel Smith