Following up on the discussion of small-array performance, I decided to profile Numeric and numarray. I've been playing with Pyrex, implementing a vector class (for doubles) that runs as fast as I can make it, and it's beating Numeric and numarray by a good bit, so I figured those two were doing something. I can't say much about numarray right now (most of the runtime is in complicated Python code, with weird callbacks to and from C code), but for Numeric it was easy to find a problem. First, profiling Python extensions is not easy. I've played with using the -pg flag to the GCC C compiler, but I couldn't find a way to profile the extension itself (even with Python built with -pg). So I see about three ways to profile an extension: 1) Wrap each function in the extension in a pair of calls to something that keeps track of time in the function. This is what the numarray.teacup module does. This is unsatisfying, intrusive, labour-intensive, and it does manually what -pg does automatically. 2) Compile the extension into the Python executable (where both the extension and Python have been compiled with the -pg flag). Unfortunately, as far as I can tell, this is not possible with distutils. If you have another build framework, however, it's not that hard to do. I've had some success with this approach with other extensions. 3) Use oprofile (http://oprofile.sourceforge.net/), which runs on Linux on a x86 processor. This is the approach that I've used here. oprofile is a combination of a kernel module for Linux, a daemon for collecting sample data, and several tools to analyse the samples. It periodically polls the processor performance counters, and records which code is running. It's a system-level profiler: it profiles _everything_ that's running on the system. One obstacle is that does require root access. Short tutorial on using oprofile -------------------------------- Using oprofile on Debian with a 2.6.x kernel is easy (replace sudo with your favourite "be-root" method): $ sudo apt-get install oprofile $ sudo modprobe oprofile # make sure the kernel module is installed Now, start the oprofile daemon. On my machine, --event=default just looks at retired CPU instructions. Read the oprofile documentation for more info. $ sudo opcontrol --event=default --start $ (run code) $ opcontrol --dump # dump the statistics to disk # this is the only thing a non-root user can do $ sudo opcontrol --stop # we don't need the daemon anymore To do another profile run, you need to reset the data $ sudo opcontrol --reset You should be able to to the above when the daemon is running, but I the daemon crashes on me when I do that; I find I end up having to also clear the old statistics manually: $ sudo rm -rf /var/lib/oprofile/samples Once you've collected samples, you can analyse the results. Here, I'll be looking at adding two 1000-element arrays with the following code: import Numeric as NA a = NA.arange(1000.0) b = NA.arange(1000.0) for i in xrange(10000000): a + b This takes 1m14s on my machine (an AMD64 3200+ running in 64-bit mode). So, where I have (run code) above, I'd do $ python x.py Once I've collected the samples, I can analyse them. Note that samples are collected on a per-application basis; if you've got other processes using python, they'll be included. You could copy the python binary to another location, and use that for the analysis, then your program would be the only picked by the following analysis. $ opstack -t 1 /usr/bin/python self % child % image name symbol name 132281 10.5031 0 0 python (no symbols) ------------------------------------------------------------------------------- 704810 55.9618 0 0 _numpy.so check_array ------------------------------------------------------------------------------- 309384 24.5650 0 0 umath.so DOUBLE_add ------------------------------------------------------------------------------- 112974 8.9701 0 0 libc-2.3.2.so (no symbols) ------------------------------------------------------------------------------- The -t 1 limits the display to those routines taking more than 1% of the runtime. 10% for python, and 10% for the C-library probably aren't so bad (I'm thinking that's calls to malloc() and friends). However, the big problem is that only 25% of the time is actually doing useful work. What's check_array doing? We can delve deeper: $ mkdir profile-Numeric $ opannotate --source -o profile-Numeric \ --search-dir=<path to Numeric source> /usr/bin/python Now profile-Numeric/<path to Numeric source>/Src has annotated copies of the source for the Numeric extension modules. The definition of check_array is in ufuncobject.c, which gives us 386 0.0286 :void check_array(PyArrayObject *ap) { /* check_array total: 7046 : double *data; : int i, n; : 371 0.0275 : if (ap->descr->type_num == PyArray_DOUBLE || ap->descr->type_num == PyArray_CDOUBLE) { 89 0.0066 : data = (double *)ap->data; 758 0.0563 : n = PyArray_Size((PyObject *)ap); 46 0.0034 : if (ap->descr->type_num == PyArray_CDOUBLE) n *= 2; : 700662 51.9988 : for(i=0; i<n; i++) CHECK(data[i]); : } :} Analysing the results --------------------- So 52% of the runtime is being done doing CHECK(). The definition of CHECK is (assuming HAVE_FINITE is not defined, which it usually isn't) #define CHECK(x) if (errno == 0 && !(-HUGE_VAL <= (x) && (x) <= HUGE_VAL)) errno = ERANGE Notwithstanding the obvious optimizations (hoisting the errno==0 check out of the loop and break'ing if the condition fails), the question is, why? All check_array is doing is checking that the elements of the array are finite. Digging deeper, we find comments like this before calls to it: "Cleanup the returned matrices so that scalars will be returned as python scalars" (it doesn't do that). check_array() is only called when the ufunc has been defined (through PyUFunc_FromFuncAndData) to check the return value (which, for all ufuncs in Numeric, it is). And it doesn't even work consistently. The intention of the above code is to throw OverflowError (which will be done if errno != 0) if the check for finiteness fails. But:
import Numeric a = Numeric.array([1e308]) a + a array([ inf])
It will catch NaN's though. It's obvious when you realize that HUGE_VAL is inf; inf <= inf is true. With HAVE_FINITE defined, I get, for the same a array,
a + a <traceback> OverflowError: math range error
The Numeric documentation has this to say about the check_return parameter to PyUFunc_FromFuncAndData (which determines whether check_array is called): Usually best to set to 1. If this is non-zero then returned matrices will be cleaned up so that rank-0 arrays will be returned as python scalars. Also, if non-zero, then any math error that sets the errno global variable will cause an appropriate Python exception to be raised. Note that the rank-0 array -> scalar conversion happens regardless. check_return doesn't affect this at all. Removing check_array -------------------- Commenting out the body of check_array in ufuncobject.c speeds up the script above by *two* times. On my iBook (a G4 800), it speeds it up by *four* times. Using timeit.py: $ python /usr/lib/python2.3/timeit.py \ -s 'import Numeric as NA; N=1e4; a=NA.arange(N)/N; b=NA.arange(N)/N' \ 'a+b' I get for various values of N: N Stock Numeric numarray numarray Numeric without recent 1.1.1 23.7 check_array CVS 1e1 1.13 1.08 10.5 9.9 1e2 1.73 1.35 10.8 10.6 1e3 6.91 3.2 13.3 12.9 1e4 83.3 42.5 52.8 52.3 1e5 4890 4420 4520 4510 1e6 52700 47400 47100 47000 1e7 532000 473000 476000 474000 Numeric is as fast as numarray now! :-) The 10x change in per-element speed between 1e4 and 1e5 is due to cache effects. N Stock Numeric numarray numarray Numeric without recent 1.1.1 23.7 check_array CVS 1e1 1.31 1.28 8.49 7.64 1e2 5.86 5.44 14.4 12.1 1e3 51.8 48 70.4 54.5 1e4 542 502 643 508 1e5 7480 6880 7430 6850 1e6 77500 70700 82700 69100 1e7 775000 710000 860000 694000 Numeric is faster than numarray from CVS, but there seems to be regression. Without check_array, Numeric is almost as fast as as numarray 1.1.1. Remarks ------- - I'd rather have my speed than checks for NaN's. Have that in a separate function (I'm willing to write one), or do numarray-style processor flag checks (tougher). - General plea: *please*, *please*, when releasing a library for which speed is a selling point, profile it first! - doing the same profiling on numarray finds 15% of the time actually adding, 65% somewhere in python, and 15% in libc. - I'm still fiddling. Using the three-argument form of Numeric.add (so no memory allocation needs to be done), 64% of the time is now spend adding; I think that could be better. The Pyrex vector class I'm working on does 80% adding (with memory allocation for the result). Hope this helps :-) -- |>|\/|< /--------------------------------------------------------------------------\ |David M. Cooke http://arbutus.physics.mcmaster.ca/dmc/ |cookedm@physics.mcmaster.ca
As I mentioned in a recent post, the original Numeric philosphy was damn the torpedos full steam ahead; performance first, safety second. There was a deliberate decision not to handle NaN, inf, or anything like it, and if you overflowed you should die. Unfortunately the original high-performance community did not remain the only community, and there were lots of complaints about dying, it being considered unPythonic to die. High-performance people don't mind dying so much; to me it just means my algorithm is wrong and I need to hear about it. But for a calculator for a biologist or student of linear algebra that's not the right answer. While I haven't researched the source history, I doubt that checking was in there before. And putting it in should have been the result of a long public discussion first. Perhaps there was one and I missed it, since I haven't paid too much attention the last few years (my present project involves Numeric only a tiny amount). When I retire maybe I will write a high-performance one. But it will be in C++ and half the people will hate me. (:-> Speed kills. But some of us have to drive fast.
Paul F. Dubois wrote:
But for a calculator for a biologist or student of linear algebra that's not the right answer.
I'm neither, but my needs are similar, and I really want Numeric to NOT stop, and just keep going with a NaN, inf or -inf. IMHO, that is the only intelligent way for an array package to behave, and you get a performance boost as well! However, my understanding is that the IEEE special values are not universally supported by compilers and/or math libraries, so this poses a problem. I wonder if it's still an issue or if all the compilers of interest have the requisite support?
When I retire maybe I will write a high-performance one. But it will be in C++ and half the people will hate me. (:->
Maybe a python wrapper around Blitz++ ? I would love that! David, If you could apply your skills to profiling array creation in numarray, you'd be doing a great service! -Chris -- Christopher Barker, Ph.D. Oceanographer NOAA/OR&R/HAZMAT (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
On 21.01.2005, at 22:40, Paul F. Dubois wrote:
As I mentioned in a recent post, the original Numeric philosphy was damn the torpedos full steam ahead; performance first, safety second. There was a deliberate decision not to handle NaN, inf, or anything like it, and if you overflowed you should die.
There was also at some time the idea of having a "safe" version of the code (added checks as a compile-time option) and an installer that compiled both with different module names such that one could ultimately choose at run time which one to use. I really liked that idea, but it never got implemented (there was a "safe" version of ufunc in some versions but it was no different from the standard one). Konrad.
konrad.hinsen@laposte.net wrote:
On 21.01.2005, at 22:40, Paul F. Dubois wrote:
As I mentioned in a recent post, the original Numeric philosphy was damn the torpedos full steam ahead; performance first, safety second. There was a deliberate decision not to handle NaN, inf, or anything like it, and if you overflowed you should die.
There was also at some time the idea of having a "safe" version of the code (added checks as a compile-time option) and an installer that compiled both with different module names such that one could ultimately choose at run time which one to use. I really liked that idea, but it never got implemented (there was a "safe" version of ufunc in some versions but it was no different from the standard one).
I really like this approach. The Blitz++ library offers something similar: if you build your code with -DBZ_DEBUG, it activates a ton of safety checks which are normally off. The performance plummets, but it can save you days of debugging, since most pointer/memory errors are flagged instantly where they occur, instead of causing the usual inscrutable segfaults. F2PY also has the debug_capi flag which provides similar services, and I've found it to be tremendously useful on a few occasions. It would be great to be able to simply use: #import Numeric import Numeric_safe as Numeric to have a safe, debug-enabled version active. The availability of such a version would also free the developers from having to cater too much to safety considerations in the default version. The default could be advertised as 'fast car, no brakes, learn to jump out before going off a cliff', with the _debug 'family minivan' being there if safety were needed. Cheers, f
Fernando Perez wrote:
konrad.hinsen@laposte.net wrote:
On 21.01.2005, at 22:40, Paul F. Dubois wrote:
As I mentioned in a recent post, the original Numeric philosphy was damn the torpedos full steam ahead; performance first, safety second. There was a deliberate decision not to handle NaN, inf, or anything like it, and if you overflowed you should die.
There was also at some time the idea of having a "safe" version of the code (added checks as a compile-time option) and an installer that compiled both with different module names such that one could ultimately choose at run time which one to use. I really liked that idea, but it never got implemented (there was a "safe" version of ufunc in some versions but it was no different from the standard one).
I really like this approach. The Blitz++ library offers something similar: if you build your code with -DBZ_DEBUG, it activates a ton of safety checks which are normally off. The performance plummets, but it can save you days of debugging, since most pointer/memory errors are flagged instantly where they occur, instead of causing the usual inscrutable segfaults.
F2PY also has the debug_capi flag which provides similar services, and I've found it to be tremendously useful on a few occasions.
It would be great to be able to simply use:
#import Numeric import Numeric_safe as Numeric
to have a safe, debug-enabled version active. The availability of such a version would also free the developers from having to cater too much to safety considerations in the default version. The default could be advertised as 'fast car, no brakes, learn to jump out before going off a cliff', with the _debug 'family minivan' being there if safety were needed.
Before embarking on such a project, I'd urge that some careful profiling be done. My gut feeling is that, for most functions, no signifigant speedup would result from omitting the range checks that prevent segfaults. In the cases where removal of such checks would help in C (item access, very small arrays, etc) their execution time will be dwarfed by Python's overhead. Without care, one runs the risk of ending up with a minivan with no brakes; something no one needs. 'take' is a likely exception since it involves range checking at every element. But if only a few functions get changed, two versions of the library is a bad idea; two versions of the functions in question would be better. Particularly since, in my experience, speed is simply not critical for most of my numeric code, for the 5% or so where speed is critical I could use the unsafe functions and be more careful. This would be easier if the few differing functions were part of the main library. I don't have a good feel for the NaN/Inf checking. If it's possible to hoist the checking to outside all of the loops, then the above arguments probably apply here as well. If not, this might be a candidate for an 'unsafe' library. That seems more reasonable to me as I'm much more tolerant of NaNs than segfaults. That's my two cents anyway. -tim
Am Samstag, 22. Januar 2005 20:16 schrieb Tim Hochberg:
I don't have a good feel for the NaN/Inf checking. If it's possible to hoist the checking to outside all of the loops, then the above arguments probably apply here as well. If not, this might be a candidate for an 'unsafe' library. That seems more reasonable to me as I'm much more tolerant of NaNs than segfaults.
Why do we need NaN/Inf checking anyway? The whole point of IEEE754 is to give operations on NaNs and Infs clearly defined results, eliminating many unnecessary checks. I think, it was one of the fundamental flaws in the design of Python not to include IEEE754 from the very beginning. Leaving the details of floating point handling completely to the implementation calls for incompatibilities and results in a situation where you can only work by trial and error instead of relying on some defined standard. -- _________________________________________Norbert Nemec Bernhardstr. 2 ... D-93053 Regensburg Tel: 0941 - 2009638 ... Mobil: 0179 - 7475199 eMail: <Norbert@Nemec-online.de>
On Jan 23, 2005, at 12:20, Norbert Nemec wrote:
I think, it was one of the fundamental flaws in the design of Python not to include IEEE754 from the very beginning. Leaving the details of floating
Python is written in C, so it couldn't make more promises about floats than the C standard does, at least not without an enormous effort. Not even the floating-point units of modern CPUs respect IEEE in all respects. Konrad. -- --------------------------------------------------------------------- Konrad Hinsen Laboratoire Léon Brillouin, CEA Saclay, 91191 Gif-sur-Yvette Cedex, France Tel.: +33-1 69 08 79 25 Fax: +33-1 69 08 82 61 E-Mail: hinsen@llb.saclay.cea.fr ---------------------------------------------------------------------
On Mon, 24 Jan 2005 konrad.hinsen@laposte.net wrote:
On Jan 23, 2005, at 12:20, Norbert Nemec wrote:
I think, it was one of the fundamental flaws in the design of Python not to include IEEE754 from the very beginning. Leaving the details of floating
Python is written in C, so it couldn't make more promises about floats than the C standard does, at least not without an enormous effort. Not even the floating-point units of modern CPUs respect IEEE in all respects.
And even if that effort had been put into Pythno, Numeric probably would've sidestepped it for performance. Note that Python does give platform-independent behavior for integer division and mod, while Numeric just gives whatever your C platform does. Warren Focke
On Sat, Jan 22, 2005 at 01:34:42AM -0700, Fernando Perez wrote:
the _debug 'family minivan' being there if safety were needed.
Don't know about that analogy :) Minivans are more likely to roll-over than your typical car. Maybe the Volvo S80 would be better: http://www.safercar.gov/NCAP/Cars/3285.html But, I have to say that I love your "import Numeric_safe as Numeric" idea :) Jason
David M. Cooke wrote:
$ opstack -t 1 /usr/bin/python self % child % image name symbol name 132281 10.5031 0 0 python (no symbols) ------------------------------------------------------------------------------- 704810 55.9618 0 0 _numpy.so check_array ------------------------------------------------------------------------------- 309384 24.5650 0 0 umath.so DOUBLE_add ------------------------------------------------------------------------------- 112974 8.9701 0 0 libc-2.3.2.so (no symbols) -------------------------------------------------------------------------------
The -t 1 limits the display to those routines taking more than 1% of the runtime. 10% for python, and 10% for the C-library probably aren't so bad (I'm thinking that's calls to malloc() and friends). However, the big problem is that only 25% of the time is actually doing useful work. What's check_array doing? We can delve deeper:
Thanks for this *excellent* tutorial and analysis. I would love to see more of this. I've never liked the check_array concept. In fact, if you use SciPy which makes several changes to things that Numeric does, check_array never runs, because self->check_return is 0 for all SciPy Ufuncs (e.g. those in fastumath). So, perhaps some of these basic benchmarks should be run by SciPy users. I forgot about this little speed-up that SciPy users enjoy all the time. SciPy has also added inf and Nan. I would be very willing to remove check_array from all Numeric ufuncs and create a separate interface for checking results, after the fact. What is the attitude of the community. -Travis --------------040503020304000703090102--
3) Use oprofile (http://oprofile.sourceforge.net/), which runs on Linux on a x86 processor. This is the approach that I've used here. oprofile is a combination of a kernel module for Linux, a daemon for collecting sample data, and several tools to analyse the samples. It periodically polls the processor performance counters, and records which code is running. It's a system-level profiler: it profiles _everything_ that's running on the system. One obstacle is that does require root access.
This looked fantastic so I tried it over the weekend. On Fedora Core 3, I couldn't get any information about numarray runtime (in the shared libraries), only Python. Ditto with Numeric, although from your post you apparently got great results including information on Numeric .so's. I'm curious: has anyone else tried this for numarray (or Numeric) on Fedora Core 3? Does anyone have a working profile script?
Numeric is faster
(with the check_array() feature deletion)
than numarray from CVS, but there seems to be regression.
(in numarray performance) Don't take this the wrong way, but how confident are you that the speed differences are real? (With my own benchmarking numbers, there is always too much fuzz to split hairs like this.)
Without check_array, Numeric is almost as fast as as numarray 1.1.1.
Remarks -------
- I'd rather have my speed than checks for NaN's. Have that in a separate function (I'm willing to write one), or do numarray-style processor flag checks (tougher).
- General plea: *please*, *please*, when releasing a library for which speed is a selling point, profile it first!
- doing the same profiling on numarray finds 15% of the time actually adding, 65% somewhere in python, and 15% in libc.
Part of this is because the numarray number protocol is still in Python.
- I'm still fiddling. Using the three-argument form of Numeric.add (so
add(a,b) and add(a,b,c) are what I've focused on for profiling numarray until the number protocol is moved to C. I've held off doing that because the numarray number protocol is complicated by subclassing issues I'm not sure are fully resolved. Regards, Todd
Todd Miller <jmiller@stsci.edu> writes:
This looked fantastic so I tried it over the weekend. On Fedora Core 3, I couldn't get any information about numarray runtime (in the shared libraries), only Python. Ditto with Numeric, although from your post you apparently got great results including information on Numeric .so's. I'm curious: has anyone else tried this for numarray (or Numeric) on Fedora Core 3? Does anyone have a working profile script?
I think you need to have --separate=lib when invoking opcontrol. (See later for an example.) Some comments on oprofile: - I think the oprofile tools (opcontrol, opreport etc.) are separate from the oprofile module, which is part of the kernel. I installed oprofile-0.8.1 from source, and it works with my standard Ubuntu kernel. It is easy to install it in a non-standard location ($HOME/usr on my system). - I think opstack is part of oprofile 0.8 (or maybe 0.8.1) -- it wasn't in the 0.7.1 package available for Ubuntu. Also, to actually get callgraphs (from opstack), you need a patched kernel; see here: http://oprofile.sf.net/patches/ - I think you probably *shouldn't* compile with -pg if you use oprofile, but you should use -g. To profile shared libraries, I also tried the following: - sprof. Some sort of dark art glibc tool. I couldn't get this to work with dlopen()'ed libraries (in which class I believe Python C extensions fall). - qprof (http://www.hpl.hp.com/research/linux/qprof/). Almost worked, but I couldn't get it to identify symbols in shared libraries. Their page has a list of other profilers. I also tried the Python 2.4 profile module; it does support C-extension functions as advertised, but it seemed to miss object instantiation calls (_numarray._numarray's instantiation, in this case). Sample oprofile usage on my Ubuntu box: rory@foo:~/hack/numarray/profile $ cat longadd.py import numarray as na a = na.arange(1000.0) b = na.arange(1000.0) for i in xrange(1000000): a + b rory@foo:~/hack/numarray/profile $ sudo modprobe oprofile Password: rory@foo:~/hack/numarray/profile $ sudo ~/usr/bin/opcontrol --start --separate=lib Using 2.6+ OProfile kernel interface. Using log file /var/lib/oprofile/oprofiled.log Daemon started. Profiler running. rory@foo:~/hack/numarray/profile $ sudo ~/usr/bin/opcontrol --reset Signalling daemon... done rory@foo:~/hack/numarray/profile $ python2.4 longadd.py rory@foo:~/hack/numarray/profile $ sudo ~/usr/bin/opcontrol --shutdown Stopping profiling. Killing daemon. rory@foo:~/hack/numarray/profile $ opreport -t 2 -l $(which python2.4) CPU: Athlon, speed 1836.45 MHz (estimated) Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask of 0x00 (No unit mask) count 100000 samples % image name symbol name 47122 11.2430 _ufuncFloat64.so add_ddxd_vvxv 26731 6.3778 python2.4 PyEval_EvalFrame 24122 5.7553 libc-2.3.2.so memset 21228 5.0648 python2.4 lookdict_string 10583 2.5250 python2.4 PyObject_GenericGetAttr 9131 2.1786 libc-2.3.2.so mcount 9026 2.1535 python2.4 PyDict_GetItem 8968 2.1397 python2.4 PyType_IsSubtype (The idea wasn't really to discuss the results, but anyway: The prominence of memset is a little odd -- are destination arrays zeroed before being assigned the sum result?) To get the libc symbols you need a libc with debug symbols -- on Ubuntu this is the libc-dbg package; I don't know what it'll be on Fedora or other systems. Set the LD_LIBRARY_PATH variable to force these debug libraries to be loaded: export LD_LIBRARY_PATH=/usr/lib/debug This is probably not all that useful -- I suppose it might be interesting if one generates callgraphs. I don't (yet) have a modified kernel, so I haven't tried this. Have fun, Rory
On Mon, 2005-01-24 at 20:47 +0200, Rory Yorke wrote:
Todd Miller <jmiller@stsci.edu> writes:
This looked fantastic so I tried it over the weekend. On Fedora Core 3, I couldn't get any information about numarray runtime (in the shared libraries), only Python. Ditto with Numeric, although from your post you apparently got great results including information on Numeric .so's. I'm curious: has anyone else tried this for numarray (or Numeric) on Fedora Core 3? Does anyone have a working profile script?
I think you need to have --separate=lib when invoking opcontrol. (See later for an example.)
Thanks! That and using a more liberal "opreport -t" setting got it.
- I think opstack is part of oprofile 0.8 (or maybe 0.8.1) -- it wasn't in the 0.7.1 package available for Ubuntu. Also, to actually get callgraphs (from opstack), you need a patched kernel; see here:
Ugh. Well, that won't happen today for me either.
- I think you probably *shouldn't* compile with -pg if you use oprofile, but you should use -g.
To profile shared libraries, I also tried the following:
- sprof. Some sort of dark art glibc tool. I couldn't get this to work with dlopen()'ed libraries (in which class I believe Python C extensions fall).
- qprof (http://www.hpl.hp.com/research/linux/qprof/). Almost worked, but I couldn't get it to identify symbols in shared libraries. Their page has a list of other profilers.
I tried gprof too but couldn't get much out of it. As David noted, gprof is a pain to use with disutils too.
I also tried the Python 2.4 profile module; it does support C-extension functions as advertised, but it seemed to miss object instantiation calls (_numarray._numarray's instantiation, in this case).
I think the thing to focus on is building an object cache for "almost- new" small NumArrays; that could potentially short circuit memory object allocation/deallocation costs, NumArray object hierarchical allocation/deallocation costs, etc.
rory@foo:~/hack/numarray/profile $ opreport -t 2 -l $(which python2.4) CPU: Athlon, speed 1836.45 MHz (estimated) Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask of 0x00 (No unit mask) count 100000 samples % image name symbol name 47122 11.2430 _ufuncFloat64.so add_ddxd_vvxv 26731 6.3778 python2.4 PyEval_EvalFrame 24122 5.7553 libc-2.3.2.so memset 21228 5.0648 python2.4 lookdict_string 10583 2.5250 python2.4 PyObject_GenericGetAttr 9131 2.1786 libc-2.3.2.so mcount 9026 2.1535 python2.4 PyDict_GetItem 8968 2.1397 python2.4 PyType_IsSubtype
(The idea wasn't really to discuss the results, but anyway: The prominence of memset is a little odd -- are destination arrays zeroed before being assigned the sum result?)
Yes, the API routines which allocate the output array zero it. I've tried to remove this in the past but at least one of the add-on packages (linear_algebra or fft I think) wasn't stable w/o the zeroing.
Have fun,
Better already. Thanks again! Todd
Hi, After a while of waiting for some free time, I'm playing myself with the excellent oprofile, and try to help in reducing numarray creation. For that goal, I selected the next small benchmark: import numarray a = numarray.arange(2000) a.shape=(1000,2) for j in xrange(1000): for i in range(len(a)): row=a[i] I know that it mixes creation with indexing cost, but as the indexing cost of numarray is only a bit slower (perhaps a 40%) than Numeric, while array creation time is 5 to 10 times slower, I think this benchmark may provide a good starting point to see what's going on. For numarray, I've got the next results: samples % image name symbol name 902 7.3238 python PyEval_EvalFrame 835 6.7798 python lookdict_string 408 3.3128 python PyObject_GenericGetAttr 384 3.1179 python PyDict_GetItem 383 3.1098 libc-2.3.2.so memcpy 358 2.9068 libpthread-0.10.so __pthread_alt_unlock 293 2.3790 python _PyString_Eq 273 2.2166 libnumarray.so NA_updateStatus 273 2.2166 python PyType_IsSubtype 271 2.2004 python countformat 252 2.0461 libc-2.3.2.so memset 249 2.0218 python string_hash 248 2.0136 _ndarray.so _universalIndexing while for Numeric I've got this: samples % image name symbol name 279 15.6478 libpthread-0.10.so __pthread_alt_unlock 216 12.1144 libc-2.3.2.so memmove 187 10.4879 python lookdict_string 162 9.0858 python PyEval_EvalFrame 144 8.0763 libpthread-0.10.so __pthread_alt_lock 126 7.0667 libpthread-0.10.so __pthread_alt_trylock 56 3.1408 python PyDict_SetItem 53 2.9725 libpthread-0.10.so __GI___pthread_mutex_unlock 45 2.5238 _numpy.so PyArray_FromDimsAndDataAndDescr 39 2.1873 libc-2.3.2.so __malloc 36 2.0191 libc-2.3.2.so __cfree one preliminary result is that numarray spends a lot more time in Python space than do Numeric, as Todd already said here. The problem is that, as I have not yet patched my kernel, I can't get the call tree, and I can't look for the ultimate responsible for that. So, I've tried to run the profile module included in the standard library in order to see which are the hot spots in python: $ time ~/python.nobackup/Python-2.4/python -m profile -s time create-numarray.py 1016105 function calls (1016064 primitive calls) in 25.290 CPU seconds Ordered by: internal time ncalls tottime percall cumtime percall filename:lineno(function) 1 19.220 19.220 25.290 25.290 create-numarray.py:1(?) 999999 5.530 0.000 5.530 0.000 numarraycore.py:514(__del__) 1753 0.160 0.000 0.160 0.000 :0(eval) 1 0.060 0.060 0.340 0.340 numarraycore.py:3(?) 1 0.050 0.050 0.390 0.390 generic.py:8(?) 1 0.040 0.040 0.490 0.490 numarrayall.py:1(?) 3455 0.040 0.000 0.040 0.000 :0(len) 1 0.030 0.030 0.190 0.190 ufunc.py:1504(_makeCUFuncDict) 51 0.030 0.001 0.070 0.001 ufunc.py:184(_nIOArgs) 3572 0.030 0.000 0.030 0.000 :0(has_key) 2582 0.020 0.000 0.020 0.000 :0(append) 1000 0.020 0.000 0.020 0.000 :0(range) 1 0.010 0.010 0.010 0.010 generic.py:510 (_stridesFromShape) 42/1 0.010 0.000 25.290 25.290 <string>:1(?) but, to say the truth, I can't really see where the time is exactly consumed. Perhaps somebody with more experience can put more light on this? Another thing that I find intriguing has to do with Numeric and oprofile output. Let me remember: samples % image name symbol name 279 15.6478 libpthread-0.10.so __pthread_alt_unlock 216 12.1144 libc-2.3.2.so memmove 187 10.4879 python lookdict_string 162 9.0858 python PyEval_EvalFrame 144 8.0763 libpthread-0.10.so __pthread_alt_lock 126 7.0667 libpthread-0.10.so __pthread_alt_trylock 56 3.1408 python PyDict_SetItem 53 2.9725 libpthread-0.10.so __GI___pthread_mutex_unlock 45 2.5238 _numpy.so PyArray_FromDimsAndDataAndDescr 39 2.1873 libc-2.3.2.so __malloc 36 2.0191 libc-2.3.2.so __cfree we can see that a lot of the time in the benchmark using Numeric is consumed in libc space (a 37% or so). However, only a 16% is used in memory-related tasks (memmove, malloc and free) while the rest seems to be used in thread issues (??). Again, anyone can explain why the pthread* routines take so many time, or why they appear here at all?. Perhaps getting rid of these calls might improve the Numeric performance even further. Cheers, --
qo< Francesc Altet http://www.carabos.com/ V V Cárabos Coop. V. Enjoy Data ""
I got some insight into what I think is the tall pole in the profile: sub-array creation is implemented using views. The generic indexing code does a view() Python callback because object arrays override view (). Faster view() creation for numerical arrays can be achieved like this by avoiding the callback: Index: Src/_ndarraymodule.c =================================================================== RCS file: /cvsroot/numpy/numarray/Src/_ndarraymodule.c,v retrieving revision 1.75 diff -c -r1.75 _ndarraymodule.c *** Src/_ndarraymodule.c 14 Jan 2005 14:13:22 -0000 1.75 --- Src/_ndarraymodule.c 28 Jan 2005 11:15:50 -0000 *************** *** 453,460 **** } } else { /* partially subscripted --> subarray */ long i; ! result = (PyArrayObject *) ! PyObject_CallMethod((PyObject *) self,"view",NULL); if (!result) goto _exit; result->nd = result->nstrides = self->nd - nindices; --- 453,463 ---- } } else { /* partially subscripted --> subarray */ long i; ! if (NA_NumArrayCheck((PyObject *)self)) ! result = _view(self); ! else ! result = (PyArrayObject *) PyObject_CallMethod( ! (PyObject *) self,"view",NULL); if (!result) goto _exit; result->nd = result->nstrides = self->nd - nindices; I committed the patch above to CVS for now. This optimization makes view() "non-overridable" for NumArray subclasses so there is probably a better way of doing this. One other thing that struck me looking at your profile, and it has been discussed before, is that NumArray.__del__() needs to be pushed (back) down into C. Getting rid of __del__ would also synergyze well with making an object freelist, one aspect of which is capturing unneeded objects rather than destroying them. Thanks for the profile. Regards, Todd On Thu, 2005-01-27 at 21:36 +0100, Francesc Altet wrote:
Hi,
After a while of waiting for some free time, I'm playing myself with the excellent oprofile, and try to help in reducing numarray creation.
For that goal, I selected the next small benchmark:
import numarray a = numarray.arange(2000) a.shape=(1000,2) for j in xrange(1000): for i in range(len(a)): row=a[i]
I know that it mixes creation with indexing cost, but as the indexing cost of numarray is only a bit slower (perhaps a 40%) than Numeric, while array creation time is 5 to 10 times slower, I think this benchmark may provide a good starting point to see what's going on.
For numarray, I've got the next results:
samples % image name symbol name 902 7.3238 python PyEval_EvalFrame 835 6.7798 python lookdict_string 408 3.3128 python PyObject_GenericGetAttr 384 3.1179 python PyDict_GetItem 383 3.1098 libc-2.3.2.so memcpy 358 2.9068 libpthread-0.10.so __pthread_alt_unlock 293 2.3790 python _PyString_Eq 273 2.2166 libnumarray.so NA_updateStatus 273 2.2166 python PyType_IsSubtype 271 2.2004 python countformat 252 2.0461 libc-2.3.2.so memset 249 2.0218 python string_hash 248 2.0136 _ndarray.so _universalIndexing
while for Numeric I've got this:
samples % image name symbol name 279 15.6478 libpthread-0.10.so __pthread_alt_unlock 216 12.1144 libc-2.3.2.so memmove 187 10.4879 python lookdict_string 162 9.0858 python PyEval_EvalFrame 144 8.0763 libpthread-0.10.so __pthread_alt_lock 126 7.0667 libpthread-0.10.so __pthread_alt_trylock 56 3.1408 python PyDict_SetItem 53 2.9725 libpthread-0.10.so __GI___pthread_mutex_unlock 45 2.5238 _numpy.so PyArray_FromDimsAndDataAndDescr 39 2.1873 libc-2.3.2.so __malloc 36 2.0191 libc-2.3.2.so __cfree
one preliminary result is that numarray spends a lot more time in Python space than do Numeric, as Todd already said here. The problem is that, as I have not yet patched my kernel, I can't get the call tree, and I can't look for the ultimate responsible for that.
So, I've tried to run the profile module included in the standard library in order to see which are the hot spots in python:
$ time ~/python.nobackup/Python-2.4/python -m profile -s time create-numarray.py 1016105 function calls (1016064 primitive calls) in 25.290 CPU seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function) 1 19.220 19.220 25.290 25.290 create-numarray.py:1(?) 999999 5.530 0.000 5.530 0.000 numarraycore.py:514(__del__) 1753 0.160 0.000 0.160 0.000 :0(eval) 1 0.060 0.060 0.340 0.340 numarraycore.py:3(?) 1 0.050 0.050 0.390 0.390 generic.py:8(?) 1 0.040 0.040 0.490 0.490 numarrayall.py:1(?) 3455 0.040 0.000 0.040 0.000 :0(len) 1 0.030 0.030 0.190 0.190 ufunc.py:1504(_makeCUFuncDict) 51 0.030 0.001 0.070 0.001 ufunc.py:184(_nIOArgs) 3572 0.030 0.000 0.030 0.000 :0(has_key) 2582 0.020 0.000 0.020 0.000 :0(append) 1000 0.020 0.000 0.020 0.000 :0(range) 1 0.010 0.010 0.010 0.010 generic.py:510 (_stridesFromShape) 42/1 0.010 0.000 25.290 25.290 <string>:1(?)
but, to say the truth, I can't really see where the time is exactly consumed. Perhaps somebody with more experience can put more light on this?
Another thing that I find intriguing has to do with Numeric and oprofile output. Let me remember:
samples % image name symbol name 279 15.6478 libpthread-0.10.so __pthread_alt_unlock 216 12.1144 libc-2.3.2.so memmove 187 10.4879 python lookdict_string 162 9.0858 python PyEval_EvalFrame 144 8.0763 libpthread-0.10.so __pthread_alt_lock 126 7.0667 libpthread-0.10.so __pthread_alt_trylock 56 3.1408 python PyDict_SetItem 53 2.9725 libpthread-0.10.so __GI___pthread_mutex_unlock 45 2.5238 _numpy.so PyArray_FromDimsAndDataAndDescr 39 2.1873 libc-2.3.2.so __malloc 36 2.0191 libc-2.3.2.so __cfree
we can see that a lot of the time in the benchmark using Numeric is consumed in libc space (a 37% or so). However, only a 16% is used in memory-related tasks (memmove, malloc and free) while the rest seems to be used in thread issues (??). Again, anyone can explain why the pthread* routines take so many time, or why they appear here at all?. Perhaps getting rid of these calls might improve the Numeric performance even further.
Cheers,
Hi Todd, Nice to see that you can achieved a good speed-up with your optimization path. With the next code: import numarray a = numarray.arange(2000) a.shape=(1000,2) for j in xrange(1000): for i in range(len(a)): row=a[i] and original numarray-1.1.1 it took 11.254s (pentium4@2GHz). With your patch, this time has been reduced to 7.816s. Now, following your suggestion to push NumArray.__del__ down into C, I've got a good speed-up as well: 5.332s. This is more that twice as fast as the unpatched numarray 1.1.1. There is still a long way until we can catch Numeric (1.123s), but it is a first step :) The patch. Please, revise it as I'm not very used with dealing with pure C extensions (just a Pyrex user): Index: Lib/numarraycore.py =================================================================== RCS file: /cvsroot/numpy/numarray/Lib/numarraycore.py,v retrieving revision 1.101 diff -r1.101 numarraycore.py 696,699c696,699 < def __del__(self): < if self._shadows != None: < self._shadows._copyFrom(self) < self._shadows = None ---
def __del__(self): if self._shadows != None: self._shadows._copyFrom(self) self._shadows = None
Index: Src/_numarraymodule.c =================================================================== RCS file: /cvsroot/numpy/numarray/Src/_numarraymodule.c,v retrieving revision 1.65 diff -r1.65 _numarraymodule.c 399a400,411
static void _numarray_dealloc(PyObject *self) { PyArrayObject *selfa = (PyArrayObject *) self;
if (selfa->_shadows != NULL) { _copyFrom(selfa->_shadows, self); selfa->_shadows = NULL; } self->ob_type->tp_free(self); }
421c433 < 0, /* tp_dealloc */ ---
_numarray_dealloc, /* tp_dealloc */
The profile with the new optimizations looks now like: samples % image name symbol name 453 8.6319 python PyEval_EvalFrame 372 7.0884 python lookdict_string 349 6.6502 python string_hash 271 5.1639 libc-2.3.2.so _wordcopy_bwd_aligned 210 4.0015 libnumarray.so NA_updateStatus 194 3.6966 python _PyString_Eq 185 3.5252 libc-2.3.2.so __GI___strcasecmp 162 3.0869 python subtype_dealloc 158 3.0107 libc-2.3.2.so _int_malloc 147 2.8011 libnumarray.so isBufferWriteable 145 2.7630 python PyDict_SetItem 135 2.5724 _ndarray.so _view 131 2.4962 python PyObject_GenericGetAttr 122 2.3247 python PyDict_GetItem 100 1.9055 python PyString_InternInPlace 94 1.7912 libnumarray.so getReadBufferDataPtr 77 1.4672 _ndarray.so _simpleIndexingCore i.e. time spent in libc and libnumarray is going up in the list, as it should. Now, we have to concentrate in other points of optimization. Perhaps is a good time to have a try on recompiling the kernel and getting the call tree... Cheers, A Divendres 28 Gener 2005 12:48, Todd Miller va escriure:
I got some insight into what I think is the tall pole in the profile: sub-array creation is implemented using views. The generic indexing code does a view() Python callback because object arrays override view (). Faster view() creation for numerical arrays can be achieved like this by avoiding the callback:
Index: Src/_ndarraymodule.c =================================================================== RCS file: /cvsroot/numpy/numarray/Src/_ndarraymodule.c,v retrieving revision 1.75 diff -c -r1.75 _ndarraymodule.c *** Src/_ndarraymodule.c 14 Jan 2005 14:13:22 -0000 1.75 --- Src/_ndarraymodule.c 28 Jan 2005 11:15:50 -0000 *************** *** 453,460 **** } } else { /* partially subscripted --> subarray */ long i; ! result = (PyArrayObject *) ! PyObject_CallMethod((PyObject *) self,"view",NULL); if (!result) goto _exit;
result->nd = result->nstrides = self->nd - nindices; --- 453,463 ---- } } else { /* partially subscripted --> subarray */ long i; ! if (NA_NumArrayCheck((PyObject *)self)) ! result = _view(self); ! else ! result = (PyArrayObject *) PyObject_CallMethod( ! (PyObject *) self,"view",NULL); if (!result) goto _exit;
result->nd = result->nstrides = self->nd - nindices;
I committed the patch above to CVS for now. This optimization makes view() "non-overridable" for NumArray subclasses so there is probably a better way of doing this.
One other thing that struck me looking at your profile, and it has been discussed before, is that NumArray.__del__() needs to be pushed (back) down into C. Getting rid of __del__ would also synergyze well with making an object freelist, one aspect of which is capturing unneeded objects rather than destroying them.
Thanks for the profile.
Regards, Todd
On Thu, 2005-01-27 at 21:36 +0100, Francesc Altet wrote:
Hi,
After a while of waiting for some free time, I'm playing myself with the excellent oprofile, and try to help in reducing numarray creation.
For that goal, I selected the next small benchmark:
import numarray a = numarray.arange(2000) a.shape=(1000,2) for j in xrange(1000): for i in range(len(a)): row=a[i]
I know that it mixes creation with indexing cost, but as the indexing cost of numarray is only a bit slower (perhaps a 40%) than Numeric, while array creation time is 5 to 10 times slower, I think this benchmark may provide a good starting point to see what's going on.
For numarray, I've got the next results:
samples % image name symbol name 902 7.3238 python PyEval_EvalFrame 835 6.7798 python lookdict_string 408 3.3128 python PyObject_GenericGetAttr 384 3.1179 python PyDict_GetItem 383 3.1098 libc-2.3.2.so memcpy 358 2.9068 libpthread-0.10.so __pthread_alt_unlock 293 2.3790 python _PyString_Eq 273 2.2166 libnumarray.so NA_updateStatus 273 2.2166 python PyType_IsSubtype 271 2.2004 python countformat 252 2.0461 libc-2.3.2.so memset 249 2.0218 python string_hash 248 2.0136 _ndarray.so _universalIndexing
while for Numeric I've got this:
samples % image name symbol name 279 15.6478 libpthread-0.10.so __pthread_alt_unlock 216 12.1144 libc-2.3.2.so memmove 187 10.4879 python lookdict_string 162 9.0858 python PyEval_EvalFrame 144 8.0763 libpthread-0.10.so __pthread_alt_lock 126 7.0667 libpthread-0.10.so __pthread_alt_trylock 56 3.1408 python PyDict_SetItem 53 2.9725 libpthread-0.10.so __GI___pthread_mutex_unlock 45 2.5238 _numpy.so PyArray_FromDimsAndDataAndDescr 39 2.1873 libc-2.3.2.so __malloc 36 2.0191 libc-2.3.2.so __cfree
one preliminary result is that numarray spends a lot more time in Python space than do Numeric, as Todd already said here. The problem is that, as I have not yet patched my kernel, I can't get the call tree, and I can't look for the ultimate responsible for that.
So, I've tried to run the profile module included in the standard library in order to see which are the hot spots in python:
$ time ~/python.nobackup/Python-2.4/python -m profile -s time create-numarray.py 1016105 function calls (1016064 primitive calls) in 25.290 CPU seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function) 1 19.220 19.220 25.290 25.290 create-numarray.py:1(?) 999999 5.530 0.000 5.530 0.000 numarraycore.py:514(__del__) 1753 0.160 0.000 0.160 0.000 :0(eval) 1 0.060 0.060 0.340 0.340 numarraycore.py:3(?) 1 0.050 0.050 0.390 0.390 generic.py:8(?) 1 0.040 0.040 0.490 0.490 numarrayall.py:1(?) 3455 0.040 0.000 0.040 0.000 :0(len) 1 0.030 0.030 0.190 0.190 ufunc.py:1504(_makeCUFuncDict) 51 0.030 0.001 0.070 0.001 ufunc.py:184(_nIOArgs) 3572 0.030 0.000 0.030 0.000 :0(has_key) 2582 0.020 0.000 0.020 0.000 :0(append) 1000 0.020 0.000 0.020 0.000 :0(range) 1 0.010 0.010 0.010 0.010 generic.py:510 (_stridesFromShape) 42/1 0.010 0.000 25.290 25.290 <string>:1(?)
but, to say the truth, I can't really see where the time is exactly consumed. Perhaps somebody with more experience can put more light on this?
Another thing that I find intriguing has to do with Numeric and oprofile output. Let me remember:
samples % image name symbol name 279 15.6478 libpthread-0.10.so __pthread_alt_unlock 216 12.1144 libc-2.3.2.so memmove 187 10.4879 python lookdict_string 162 9.0858 python PyEval_EvalFrame 144 8.0763 libpthread-0.10.so __pthread_alt_lock 126 7.0667 libpthread-0.10.so __pthread_alt_trylock 56 3.1408 python PyDict_SetItem 53 2.9725 libpthread-0.10.so __GI___pthread_mutex_unlock 45 2.5238 _numpy.so PyArray_FromDimsAndDataAndDescr 39 2.1873 libc-2.3.2.so __malloc 36 2.0191 libc-2.3.2.so __cfree
we can see that a lot of the time in the benchmark using Numeric is consumed in libc space (a 37% or so). However, only a 16% is used in memory-related tasks (memmove, malloc and free) while the rest seems to be used in thread issues (??). Again, anyone can explain why the pthread* routines take so many time, or why they appear here at all?. Perhaps getting rid of these calls might improve the Numeric performance even further.
Cheers,
------------------------------------------------------- This SF.Net email is sponsored by: IntelliVIEW -- Interactive Reporting Tool for open source databases. Create drag-&-drop reports. Save time by over 75%! Publish reports on the web. Export to DOC, XLS, RTF, etc. Download a FREE copy at http://www.intelliview.com/go/osdn_nl _______________________________________________ Numpy-discussion mailing list Numpy-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/numpy-discussion
--
qo< Francesc Altet http://www.carabos.com/ V V Cárabos Coop. V. Enjoy Data ""
Nice work! But... IRC, there's a problem with moving __del__ down to C, possibly only for a --with-pydebug Python, I can't remember. It's a serious problem though... it dumps core. I'll try to see if I can come up with something conditionally compiled. Related note to Make Todd's Life Easy: use "cvs diff -c" to make context diffs which "patch" applies effortlessly. Thanks for getting the ball rolling. 2x is nothing to sneeze at. Todd On Fri, 2005-01-28 at 17:27, Francesc Altet wrote:
Hi Todd,
Nice to see that you can achieved a good speed-up with your optimization path. With the next code:
import numarray a = numarray.arange(2000) a.shape=(1000,2) for j in xrange(1000): for i in range(len(a)): row=a[i]
and original numarray-1.1.1 it took 11.254s (pentium4@2GHz). With your patch, this time has been reduced to 7.816s. Now, following your suggestion to push NumArray.__del__ down into C, I've got a good speed-up as well: 5.332s. This is more that twice as fast as the unpatched numarray 1.1.1. There is still a long way until we can catch Numeric (1.123s), but it is a first step :)
The patch. Please, revise it as I'm not very used with dealing with pure C extensions (just a Pyrex user):
Index: Lib/numarraycore.py =================================================================== RCS file: /cvsroot/numpy/numarray/Lib/numarraycore.py,v retrieving revision 1.101 diff -r1.101 numarraycore.py 696,699c696,699 < def __del__(self): < if self._shadows != None: < self._shadows._copyFrom(self) < self._shadows = None ---
def __del__(self): if self._shadows != None: self._shadows._copyFrom(self) self._shadows = None
Index: Src/_numarraymodule.c =================================================================== RCS file: /cvsroot/numpy/numarray/Src/_numarraymodule.c,v retrieving revision 1.65 diff -r1.65 _numarraymodule.c 399a400,411
static void _numarray_dealloc(PyObject *self) { PyArrayObject *selfa = (PyArrayObject *) self;
if (selfa->_shadows != NULL) { _copyFrom(selfa->_shadows, self); selfa->_shadows = NULL; } self->ob_type->tp_free(self); }
421c433 < 0, /* tp_dealloc */ ---
_numarray_dealloc, /* tp_dealloc */
The profile with the new optimizations looks now like:
samples % image name symbol name 453 8.6319 python PyEval_EvalFrame 372 7.0884 python lookdict_string 349 6.6502 python string_hash 271 5.1639 libc-2.3.2.so _wordcopy_bwd_aligned 210 4.0015 libnumarray.so NA_updateStatus 194 3.6966 python _PyString_Eq 185 3.5252 libc-2.3.2.so __GI___strcasecmp 162 3.0869 python subtype_dealloc 158 3.0107 libc-2.3.2.so _int_malloc 147 2.8011 libnumarray.so isBufferWriteable 145 2.7630 python PyDict_SetItem 135 2.5724 _ndarray.so _view 131 2.4962 python PyObject_GenericGetAttr 122 2.3247 python PyDict_GetItem 100 1.9055 python PyString_InternInPlace 94 1.7912 libnumarray.so getReadBufferDataPtr 77 1.4672 _ndarray.so _simpleIndexingCore
i.e. time spent in libc and libnumarray is going up in the list, as it should. Now, we have to concentrate in other points of optimization. Perhaps is a good time to have a try on recompiling the kernel and getting the call tree...
Cheers,
A Divendres 28 Gener 2005 12:48, Todd Miller va escriure:
I got some insight into what I think is the tall pole in the profile: sub-array creation is implemented using views. The generic indexing code does a view() Python callback because object arrays override view (). Faster view() creation for numerical arrays can be achieved like this by avoiding the callback:
Index: Src/_ndarraymodule.c =================================================================== RCS file: /cvsroot/numpy/numarray/Src/_ndarraymodule.c,v retrieving revision 1.75 diff -c -r1.75 _ndarraymodule.c *** Src/_ndarraymodule.c 14 Jan 2005 14:13:22 -0000 1.75 --- Src/_ndarraymodule.c 28 Jan 2005 11:15:50 -0000 *************** *** 453,460 **** } } else { /* partially subscripted --> subarray */ long i; ! result = (PyArrayObject *) ! PyObject_CallMethod((PyObject *) self,"view",NULL); if (!result) goto _exit;
result->nd = result->nstrides = self->nd - nindices; --- 453,463 ---- } } else { /* partially subscripted --> subarray */ long i; ! if (NA_NumArrayCheck((PyObject *)self)) ! result = _view(self); ! else ! result = (PyArrayObject *) PyObject_CallMethod( ! (PyObject *) self,"view",NULL); if (!result) goto _exit;
result->nd = result->nstrides = self->nd - nindices;
I committed the patch above to CVS for now. This optimization makes view() "non-overridable" for NumArray subclasses so there is probably a better way of doing this.
One other thing that struck me looking at your profile, and it has been discussed before, is that NumArray.__del__() needs to be pushed (back) down into C. Getting rid of __del__ would also synergyze well with making an object freelist, one aspect of which is capturing unneeded objects rather than destroying them.
Thanks for the profile.
Regards, Todd
On Thu, 2005-01-27 at 21:36 +0100, Francesc Altet wrote:
Hi,
After a while of waiting for some free time, I'm playing myself with the excellent oprofile, and try to help in reducing numarray creation.
For that goal, I selected the next small benchmark:
import numarray a = numarray.arange(2000) a.shape=(1000,2) for j in xrange(1000): for i in range(len(a)): row=a[i]
I know that it mixes creation with indexing cost, but as the indexing cost of numarray is only a bit slower (perhaps a 40%) than Numeric, while array creation time is 5 to 10 times slower, I think this benchmark may provide a good starting point to see what's going on.
For numarray, I've got the next results:
samples % image name symbol name 902 7.3238 python PyEval_EvalFrame 835 6.7798 python lookdict_string 408 3.3128 python PyObject_GenericGetAttr 384 3.1179 python PyDict_GetItem 383 3.1098 libc-2.3.2.so memcpy 358 2.9068 libpthread-0.10.so __pthread_alt_unlock 293 2.3790 python _PyString_Eq 273 2.2166 libnumarray.so NA_updateStatus 273 2.2166 python PyType_IsSubtype 271 2.2004 python countformat 252 2.0461 libc-2.3.2.so memset 249 2.0218 python string_hash 248 2.0136 _ndarray.so _universalIndexing
while for Numeric I've got this:
samples % image name symbol name 279 15.6478 libpthread-0.10.so __pthread_alt_unlock 216 12.1144 libc-2.3.2.so memmove 187 10.4879 python lookdict_string 162 9.0858 python PyEval_EvalFrame 144 8.0763 libpthread-0.10.so __pthread_alt_lock 126 7.0667 libpthread-0.10.so __pthread_alt_trylock 56 3.1408 python PyDict_SetItem 53 2.9725 libpthread-0.10.so __GI___pthread_mutex_unlock 45 2.5238 _numpy.so PyArray_FromDimsAndDataAndDescr 39 2.1873 libc-2.3.2.so __malloc 36 2.0191 libc-2.3.2.so __cfree
one preliminary result is that numarray spends a lot more time in Python space than do Numeric, as Todd already said here. The problem is that, as I have not yet patched my kernel, I can't get the call tree, and I can't look for the ultimate responsible for that.
So, I've tried to run the profile module included in the standard library in order to see which are the hot spots in python:
$ time ~/python.nobackup/Python-2.4/python -m profile -s time create-numarray.py 1016105 function calls (1016064 primitive calls) in 25.290 CPU seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function) 1 19.220 19.220 25.290 25.290 create-numarray.py:1(?) 999999 5.530 0.000 5.530 0.000 numarraycore.py:514(__del__) 1753 0.160 0.000 0.160 0.000 :0(eval) 1 0.060 0.060 0.340 0.340 numarraycore.py:3(?) 1 0.050 0.050 0.390 0.390 generic.py:8(?) 1 0.040 0.040 0.490 0.490 numarrayall.py:1(?) 3455 0.040 0.000 0.040 0.000 :0(len) 1 0.030 0.030 0.190 0.190 ufunc.py:1504(_makeCUFuncDict) 51 0.030 0.001 0.070 0.001 ufunc.py:184(_nIOArgs) 3572 0.030 0.000 0.030 0.000 :0(has_key) 2582 0.020 0.000 0.020 0.000 :0(append) 1000 0.020 0.000 0.020 0.000 :0(range) 1 0.010 0.010 0.010 0.010 generic.py:510 (_stridesFromShape) 42/1 0.010 0.000 25.290 25.290 <string>:1(?)
but, to say the truth, I can't really see where the time is exactly consumed. Perhaps somebody with more experience can put more light on this?
Another thing that I find intriguing has to do with Numeric and oprofile output. Let me remember:
samples % image name symbol name 279 15.6478 libpthread-0.10.so __pthread_alt_unlock 216 12.1144 libc-2.3.2.so memmove 187 10.4879 python lookdict_string 162 9.0858 python PyEval_EvalFrame 144 8.0763 libpthread-0.10.so __pthread_alt_lock 126 7.0667 libpthread-0.10.so __pthread_alt_trylock 56 3.1408 python PyDict_SetItem 53 2.9725 libpthread-0.10.so __GI___pthread_mutex_unlock 45 2.5238 _numpy.so PyArray_FromDimsAndDataAndDescr 39 2.1873 libc-2.3.2.so __malloc 36 2.0191 libc-2.3.2.so __cfree
we can see that a lot of the time in the benchmark using Numeric is consumed in libc space (a 37% or so). However, only a 16% is used in memory-related tasks (memmove, malloc and free) while the rest seems to be used in thread issues (??). Again, anyone can explain why the pthread* routines take so many time, or why they appear here at all?. Perhaps getting rid of these calls might improve the Numeric performance even further.
Cheers,
------------------------------------------------------- This SF.Net email is sponsored by: IntelliVIEW -- Interactive Reporting Tool for open source databases. Create drag-&-drop reports. Save time by over 75%! Publish reports on the web. Export to DOC, XLS, RTF, etc. Download a FREE copy at http://www.intelliview.com/go/osdn_nl _______________________________________________ Numpy-discussion mailing list Numpy-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/numpy-discussion --
Am Donnerstag 27 Januar 2005 21:36 schrieb Francesc Altet:
So, I've tried to run the profile module included in the standard library in order to see which are the hot spots in python:
$ time ~/python.nobackup/Python-2.4/python -m profile -s time create-numarray.py 1016105 function calls (1016064 primitive calls) in 25.290 CPU seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function) 1 19.220 19.220 25.290 25.290 create-numarray.py:1(?) 999999 5.530 0.000 5.530 0.000 numarraycore.py:514(__del__)
Might it actually be that (at least part of) the speed problem lies in __del__? I don't have any tools for benchmarking at hand, so I can only ask other to experiment, but I recall that it already struck me odd a little while ago, that hitting Ctrl-C in the middle of numarray-calculations nearly always gave me a backtrace ending inside a __del__ function. Should be trivial to test: deactivate __del__ completely for a test run. Ciao, Norbert -- _________________________________________Norbert Nemec Bernhardstr. 2 ... D-93053 Regensburg Tel: 0941 - 2009638 ... Mobil: 0179 - 7475199 eMail: <Norbert@Nemec-online.de>
Hi there, indeed my suspicion has proven correct: Am Donnerstag 27 Januar 2005 21:36 schrieb Francesc Altet: [...]
For that goal, I selected the next small benchmark:
import numarray a = numarray.arange(2000) a.shape=(1000,2) for j in xrange(1000): for i in range(len(a)): row=a[i] [...] So, I've tried to run the profile module included in the standard library in order to see which are the hot spots in python:
$ time ~/python.nobackup/Python-2.4/python -m profile -s time create-numarray.py 1016105 function calls (1016064 primitive calls) in 25.290 CPU seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function) 1 19.220 19.220 25.290 25.290 create-numarray.py:1(?) 999999 5.530 0.000 5.530 0.000 numarraycore.py:514(__del__) 1753 0.160 0.000 0.160 0.000 :0(eval) [...]
This benchmark made me suspicious since I had already found it odd before that killing a numarray calculation with Ctrl-C nearly always gives a backtrace starting in __del__ I did the simple thing: simply comment out the NumArray.__del__ routine (numarraycore.py, line 514, current CVS version) The result is astonishing: Vanilla numarray: nobbi@Marvin:~/tmp $ time python create-array.py real 0m9.457s user 0m8.851s sys 0m0.038s NumArray.__del__ commented out: nobbi@Marvin:~/tmp $ time python create-array.py real 0m6.512s user 0m6.065s sys 0m0.021s 30% speedup !!!!!! Doing a detailed benchmarking shows similar results. I don't think I have to go on about this at this point. It seems clear that __del__ has to be avoided in such a central position. Ciao, Norbert -- _________________________________________Norbert Nemec Bernhardstr. 2 ... D-93053 Regensburg Tel: 0941 - 2009638 ... Mobil: 0179 - 7475199 eMail: <Norbert@Nemec-online.de>
This benchmark made me suspicious since I had already found it odd before that killing a numarray calculation with Ctrl-C nearly always gives a backtrace starting in __del__
Much of the python machinery may have been torn down when your __del__ method is called while the interpreter is exiting (I'm asuming you're talking about a script, rather than interactive mode). Code should be prepared for anything to fail - it's quite common for parts of __builtins__ to have been disassembled, etc. The language reference has this to say: http://python.org/doc/2.3.4/ref/customization.html#l2h-174 Warning: Due to the precarious circumstances under which __del__() methods are invoked, exceptions that occur during their execution are ignored, and a warning is printed to sys.stderr instead. Also, when __del__() is invoked in response to a module being deleted (e.g., when execution of the program is done), other globals referenced by the __del__() method may already have been deleted. For this reason, __del__() methods should do the absolute minimum needed to maintain external invariants. Starting with version 1.5, Python guarantees that globals whose name begins with a single underscore are deleted from their module before other globals are deleted; if no other references to such globals exist, this may help in assuring that imported modules are still available at the time when the __del__() method is called. Another important caveat of classes with __del__ methods is mentioned in the library reference for the "gc" module: http://python.org/doc/2.3.4/lib/module-gc.html Objects that have __del__() methods and are part of a reference cycle cause the entire reference cycle to be uncollectable, including objects not necessarily in the cycle but reachable only from it. Python doesn't collect such cycles automatically because, in general, it isn't possible for Python to guess a safe order in which to run the __del__() methods. -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/
Thanks Andrew, that was a useful summary. I wish I had more time to work on optimizing numarray personally, but I don't. Instead I'll try to share what I know of the state of __del__/tp_dealloc so that people who want to work on it can come up with something better: 1. We need __del__/tp_dealloc. (This may be controversial but I hope not). Using the destructor makes the high level C-API cleaner. Getting rid of it means changing the C-API. __del__/tp_dealloc is used to transparently copy the contents of a working array back onto an ill-behaved (byteswapped, etc...) source array at extension function exit time. 2. There's a problem with the tp_dealloc I originally implemented which causes it to segfault for a ./configure'ed --with-pydebug Python. Looking at it today, it looks like it may be an exit-time garbage collection problem. There is no explicit garbage collection support in _numarray or _ndarray, so that may be the problem. 3. We're definitely not exploiting the "single underscore rule" yet. We use single underscores mostly to hide globals from module export. I don't think this is really critical, but that's the state of things. 4. Circular references should only be a problem for numerical arrays with "user introduced" cycles. numarray ObjectArrays have no __del__. I attached a patch against CVS that reinstates the old tp_dealloc; this shows where I left off in case someone has insight on how to fix it. I haven't tested it recently for a non-debug Python; I think it works. The patch segfaults after the C-API examples/selftest for debug Pythons: % python setup.py install --selftest Using EXTRA_COMPILE_ARGS = [] running install running build running build_py copying Lib/numinclude.py -> build/lib.linux-i686-2.4/numarray running build_ext running install_lib copying build/lib.linux-i686-2.4/numarray/numinclude.py -> /home/jmiller/work/lib/python2.4/site-packages/numarray byte-compiling /home/jmiller/work/lib/python2.4/site-packages/numarray/numinclude.py to numinclude.pyc running install_headers copying Include/numarray/numconfig.h -> /home/jmiller/work/include/python2.4/numarray running install_data Testing numarray 1.2a on Python (2, 4, 0, 'final', 0) numarray.numtest: ((0, 1231), (0, 1231)) numarray.ieeespecial: (0, 20) numarray.records: (0, 48) numarray.strings: (0, 186) numarray.memmap: (0, 82) numarray.objects: (0, 105) numarray.memorytest: (0, 16) numarray.examples.convolve: ((0, 20), (0, 20), (0, 20), (0, 20)) Segmentation fault (core dumped) That's all I can add for now. Regards, Todd On Mon, 2005-01-31 at 11:54 +1100, Andrew McNamara wrote:
This benchmark made me suspicious since I had already found it odd before that killing a numarray calculation with Ctrl-C nearly always gives a backtrace starting in __del__
Much of the python machinery may have been torn down when your __del__ method is called while the interpreter is exiting (I'm asuming you're talking about a script, rather than interactive mode). Code should be prepared for anything to fail - it's quite common for parts of __builtins__ to have been disassembled, etc.
The language reference has this to say:
http://python.org/doc/2.3.4/ref/customization.html#l2h-174
Warning: Due to the precarious circumstances under which __del__() methods are invoked, exceptions that occur during their execution are ignored, and a warning is printed to sys.stderr instead. Also, when __del__() is invoked in response to a module being deleted (e.g., when execution of the program is done), other globals referenced by the __del__() method may already have been deleted. For this reason, __del__() methods should do the absolute minimum needed to maintain external invariants. Starting with version 1.5, Python guarantees that globals whose name begins with a single underscore are deleted from their module before other globals are deleted; if no other references to such globals exist, this may help in assuring that imported modules are still available at the time when the __del__() method is called.
Another important caveat of classes with __del__ methods is mentioned in the library reference for the "gc" module:
http://python.org/doc/2.3.4/lib/module-gc.html
Objects that have __del__() methods and are part of a reference cycle cause the entire reference cycle to be uncollectable, including objects not necessarily in the cycle but reachable only from it. Python doesn't collect such cycles automatically because, in general, it isn't possible for Python to guess a safe order in which to run the __del__() methods.
participants (14)
-
Andrew McNamara
-
Chris Barker
-
cookedm@physics.mcmaster.ca
-
Fernando Perez
-
Francesc Altet
-
Jason Rennie
-
konrad.hinsen@laposte.net
-
Norbert Nemec
-
Paul F. Dubois
-
Rory Yorke
-
Tim Hochberg
-
Todd Miller
-
Travis Oliphant
-
Warren Focke