A Dimarts 29 Març 2005 15:23, Francesc Altet va escriure:
This issue has been brought up to this list some months ago (see [1]). I, as for one, have renounced to call NA_updateDataPtr() during table reads in PyTables and this speeded up the reading process by 70%, which is not a joke. And this speed-up could be theoretically achieved in every piece of code that reads like:
for i range(n): a = numarrayobject[i]
that is, whenever a single element in array is accessed.
Well, the statement above is not exactly true. The overhead introduced by NA_updateDataPtr (and other functions related with the buffer object) is mainly important when you call the __getitem__ method from *extensions* and less important (but yet significant!) when you are in pure Python. This evening I wanted to evaluate how much would be the acceleration if it would be not necessary to call NA_updateDataPtr and companions (i.e. getting rid of the buffer object), found some interesting results and ended doing a quite long report that took this sunny Spring evening away from me :( Despite its rather serious format, please, don't look at it as a serious demonstration of nothing. It was made basically because I need maximum performance on __getitem__ operations and was curious on what Numeric/numarray/Numeric3 can offer in that regard. If I'm publishing it here is because it could of help for somebody. Cheers, --
qo< Francesc Altet http://www.carabos.com/ V V Cárabos Coop. V. Enjoy Data ""
A note on __getitem__ performance on Numeric/numarray on Python extensions (with an small follow-up on Numeric3) ========================================================================== Francesc Altet 2005-03-29 Abstract ======== Numeric [1] and numarray [2] are Python packages that provide very convenient containers to deal with large amounts of data in memory in an efficient way. The fact that they have quite different implementations lends naturally to areas where one package is better suited than the other, and vice-versa. In fact, it is a luck to have such a duality because competence is basic on every software (sane) ecosystem. The best way of determining which package is better adapted to do a certain task is benchmarking. In this report, I have made use of Pyrex [3] and oprofile [4] in order to decide which is the best candidate to be used for accessing the data in the containers from C extensions. In the appendix, some attention has been dedicated as well to Numeric3, a new-born contender for Numeric and numarray. Motivation ========== I need peak performance when accessing to data belonging to Numeric/numarray objects in my extensions, so I decided to do some profiling on the next code, which is representative of my own needs: niter = 5 N = 1000*1000 def matrix_loop(object): for j in xrange(niter): for i in xrange(N): p = object[i] This basically exercises the __getitem__ special method in Numeric/numarray objects. The benchmark ============= In order to get some comparisons done, I've made a small script (getitem-numarrayVSNumeric.py) that checks the speed for both kinds of objects: Numeric and numarray. Also, and in order to reduce the Python overhead, I've used psyco [3] so that the results may get as close as possible as if these tests were running inside a Python extension (made in C). Moreover, I've used the oprofile [4] so as to get an idea of where the CPU is wasted in this loop. First of all, I've made a calibration test to measure the time of the empty loop, that is: def null_loop(): for j in xrange(niter): for i in xrange(N): pass This time is almost negligible when running with Psyco (and the same happens inside a C extension), but it takes a *significant* time if psyco is not active. Once this time has been measured, it is substracted from the loops that actually exercise __getitem__. First (naive) timings ===================== Now, let's see some of the timings that I've done. My platform is a Pentium4 @ 2GHZ laptop, using Debian GNU/Linux and kernel 2.6.9 and with gcc 3.3.5. First of all, I'll list the results without psyco: $ python2.3 bench/getitem-numarrayVSNumeric.py Psyco not active Numeric version: 23.8 numarray version: 1.2.3 Calibration loop: 0.11173081398 Time for numarray(getitem)/iter: 3.82528972626e-07 Time for Numeric(getitem)/iter: 2.51150989532e-07 getitem in Numeric is 1.52310358537 times faster We can see how the time per iteration for numarray is 380 ns while for Numeric is 250 ns, which accounts for a 1.5x speed-up of Numeric vs numarray. Using psyco to reduce Python overhead ===================================== However, and even though we have substracted the time for the calibration loop, there may remain other places were time is wasted in Python space. Psyco is a good manner to optimize loops and make them go almost as fast as in C. Now, the figures using psyco: $ python2.3 bench/getitem-numarrayVSNumeric.py Psyco active Numeric version: 23.8 numarray version: 1.2.3 Calibration loop: 0.0015878200531 Time for numarray(getitem)/iter: 2.4246096611e-07 Time for Numeric(getitem)/iter: 1.19336557388e-07 getitem in Numeric is 2.0317409134 times faster We can see how the time for the calibration loop has been improved a factor 100x. Not too bad for a silly loop. Also, the time per iteration for numarray has dropped to 242 ns and to 119 ns for Numeric. This accounts for a 2x speedup. The first conclusion is that numarray is considerably slower than Numeric when accessing its data. Besides, when using psyco, part of the Python overhead evaporates, making the gap between Numeric and numarray loops to grow. Introducing oprofile: getting a broad view of what's going on ============================================================= In order to measure the exact difference of __getitem__ method without the Python overhead (in an extension, for example) I've used oprofile against the psyco version of the benchmark. Here is the result for the run with psyco and profiled with oprofile: # opreport /usr/bin/python2.3 samples| %| ------------------ 586 34.1293 libnumarray.so 454 26.4415 python2.3 331 19.2778 _numpy.so 206 11.9977 _ndarray.so 102 5.9406 memory.so 22 1.2813 libc-2.3.2.so 9 0.5242 ld-2.3.2.so 4 0.2330 multiarray.so 2 0.1165 _sort.so 1 0.0582 _psyco.so libnumarray.so, _ndarray.so, memory.so and _sort.so shared libraries all belongs to numarray package. The _numpy.so and multiarray.so fall into Numeric. The time spent in python space is very little (just a 26%, in a great deal thanks to psyco acceleration). The libc-2.3.2.so and ld-2.3.2.so belongs to the C runtime library, and it is not possible to decide whether this time has been used by numarray, Numeric or Python itself, but as the time consumed is very little, we can safely ignore it. So, if we sum the samples when the CPU was in the C space (the shared libs) in numarray, and compare against the time in C space in Numeric, we get that this is 894 against 331, which means that Numeric is 2.7x faster than numarray for __getitem__. Of course, this is more than 1.5x and 2x factor that we get earlier because of the time spent in python space. However, the 2.7x factor is probably more accurate when one wants to exercise __getitem__ in C extensions. Most CPU intensive functions using oprofile ========================================== If we want to look at the most consuming functions in numarray: # opstack -t 1 /usr/bin/python2.3 | sort -nr| head -10 454 26.6432 python2.3 (no symbols) 331 19.4249 _numpy.so (no symbols) 145 8.5094 libnumarray.so NA_getPythonScalar 115 6.7488 libnumarray.so NA_getByteOffset 101 5.9272 libnumarray.so isBufferWriteable 98 5.7512 _ndarray.so _ndarray_subscript 91 5.3404 _ndarray.so _simpleIndexingCore 73 4.2840 libnumarray.so NA_updateDataPtr 64 3.7559 memory.so memory_getbuf 60 3.5211 libnumarray.so getReadBufferDataPtr The _numpy.so was stripped out of debugging info, so we can't see where the time was spent in Numeric. However, we can estimate the cost for getting a fresh pointer for the data buffer for every data access in numarray: isBufferWriteable+NA_updateDataPtr+memory_getbuf+getReadBufferDataPtr gives a total of 298 samples, which is almost as much as all the time spent by the Numeric shared library (331). So we can conclude that having a buffer object in our array object can be a serious drawback if we want to get maximum performance for accessing the data. Another point that can be worth to look at is in NA_getByteOffset that takes 115 samples by itself. This is perhaps a little too much. Conclusions =========== To sum up, we can expect that the __getitem__ method in Numeric would be 1.5x times faster than numarray in pure python code, 2x when using Psyco, and 2.7x times faster when used in C extensions. One factor that (partially) explain that numarray is slower in this area is that it is based on the buffer interface to keep its data. This feature, while very convenient for certain tasks (like sharing data with other Python packages or extensions), has a limitation that make an extension to crash if the memory buffer is reallocated. Other solutions (like the "bytes" object [5]) has been proposed to overcome this limitation (and others) of the buffer interface. Numeric3 might choose this to avoid these kind of contention problems created by the buffer interface. Finally, we have seen how using oprofile could be of unvaluable help for determining where the hot spots are, not only in our extensions, but also in other shared libraries in our system. If the shared libraries also have debugging info on them, then it would be possible to track down even the most expensive routines in our application. Appendix ======== Even though it is in the very early stages of existence, I was curious about how Numeric3 [3] would perform in comparison with Numeric. By slightly changing getitem-numarrayVSNumeric.py, I've come up with getitem-NumericVSNumeric3.py, which do the comparison I wanted to. When running without psyco, I got: $ python2.3 bench/getitem-NumericVSNumeric3.py Psyco not active Numeric version: 23.8 Numeric3 version: Very early alpha release...! Calibration loop: 0.107951593399 Time for Numeric3(getitem)/iter: 1.18472018242e-06 Time for Numeric(getitem)/iter: 2.45458602905e-07 getitem in Numeric is 4.82655799551 times faster Ops, Numeric3 is almost 5 times slower than Numeric. So it really seems to be still in very alpha (you know, premature optimization is the root of all evils). Never mind, this is just an exercise. So, let's continue with the psyco version: $ python2.3 bench/getitem-NumericVSNumeric3.py Psyco active Numeric version: 23.8 Numeric3 version: Very early alpha release...! Calibration loop: 0.00171356201172 Time for Numeric3(getitem)/iter: 1.04013824463e-06 Time for Numeric(getitem)/iter: 1.19578647614e-07 getitem in Numeric is 8.69836099828 times faster The gap has increased to 8.6x as expected. Let's have a look at the most consuming shared libs by using oprofile: # opreport /usr/bin/python2.3 samples| %| ------------------ 1841 33.7365 multiarray.so 1701 31.1710 libc-2.3.2.so 1586 29.0636 python2.3 318 5.8274 _numpy.so 6 0.1100 ld-2.3.2.so 3 0.0550 multiarray.so 2 0.0367 _psyco.so God! two libraries alone are getting more than half of the CPU: multiarray.so and libc-2.3.2.so. As we already know that Numeric3 __getitem__ takes much more time than its counterpart in Numeric, we can conclude that Numeric3 comes with its own multiarray.so, and that it is responsible for taking one third (33.7%) of the time. Moreover, multiarray.so should be the responsible to be calling the libc routines so much, because in our previous benchmarks, the libc calls never took more than 5% of the time, and here is taking more than 30%. To conclude, let's see which are the most consuming routines in Numeric3 for this exercise: # opstack -t 1 /usr/bin/python2.3 | sort -nr| head -20 1586 30.1750 python2.3 (no symbols) 669 12.7283 libc-2.3.2.so __GI___strcasecmp 618 11.7580 multiarray.so PyArray_MapIterNew 374 7.1157 multiarray.so array_subscript 318 6.0502 _numpy.so (no symbols) 260 4.9467 libc-2.3.2.so __realloc 190 3.6149 libc-2.3.2.so _int_malloc 172 3.2725 multiarray.so PyArray_New 152 2.8919 libc-2.3.2.so __strncasecmp 123 2.3402 libc-2.3.2.so malloc_consolidate 121 2.3021 libc-2.3.2.so __memalign_internal 118 2.2451 multiarray.so array_dealloc 102 1.9406 libc-2.3.2.so _int_realloc 93 1.7694 multiarray.so fancy_indexing_check 86 1.6362 multiarray.so arraymapiter_dealloc 79 1.5030 multiarray.so PyArray_Scalar 76 1.4460 multiarray.so LONG_copyswapn 62 1.1796 multiarray.so PyArray_UpdateFlags 57 1.0845 multiarray.so PyArray_DescrFromType While we can see that a lot of time is spent inside the multiarray.so of Numeric3 it also catch our attention that a lot of time is spent doing the __GI___strcasecmp system call. This is very strange, because our arrays are made of integers and calling strcasecmp on each iteration seems like very unnecessary. In order to know who is calling strcasecmp (i.e. get the call tree), oprofile needs a special patched version of the linux kernel. But this is material for another story. References ========== [1] http://numpy.sourceforge.net/ [2] http://stsdas.stsci.edu/numarray/ [3] http://psyco.sourceforge.net/ [4] http://oprofile.sourceforge.net/ [5] http://www.python.org/peps/pep-0296.html