Hi Todd, Nice to see that you can achieved a good speed-up with your optimization path. With the next code: import numarray a = numarray.arange(2000) a.shape=(1000,2) for j in xrange(1000): for i in range(len(a)): row=a[i] and original numarray-1.1.1 it took 11.254s (pentium4@2GHz). With your patch, this time has been reduced to 7.816s. Now, following your suggestion to push NumArray.__del__ down into C, I've got a good speed-up as well: 5.332s. This is more that twice as fast as the unpatched numarray 1.1.1. There is still a long way until we can catch Numeric (1.123s), but it is a first step :) The patch. Please, revise it as I'm not very used with dealing with pure C extensions (just a Pyrex user): Index: Lib/numarraycore.py =================================================================== RCS file: /cvsroot/numpy/numarray/Lib/numarraycore.py,v retrieving revision 1.101 diff -r1.101 numarraycore.py 696,699c696,699 < def __del__(self): < if self._shadows != None: < self._shadows._copyFrom(self) < self._shadows = None ---
def __del__(self): if self._shadows != None: self._shadows._copyFrom(self) self._shadows = None
Index: Src/_numarraymodule.c =================================================================== RCS file: /cvsroot/numpy/numarray/Src/_numarraymodule.c,v retrieving revision 1.65 diff -r1.65 _numarraymodule.c 399a400,411
static void _numarray_dealloc(PyObject *self) { PyArrayObject *selfa = (PyArrayObject *) self;
if (selfa->_shadows != NULL) { _copyFrom(selfa->_shadows, self); selfa->_shadows = NULL; } self->ob_type->tp_free(self); }
421c433 < 0, /* tp_dealloc */ ---
_numarray_dealloc, /* tp_dealloc */
The profile with the new optimizations looks now like: samples % image name symbol name 453 8.6319 python PyEval_EvalFrame 372 7.0884 python lookdict_string 349 6.6502 python string_hash 271 5.1639 libc-2.3.2.so _wordcopy_bwd_aligned 210 4.0015 libnumarray.so NA_updateStatus 194 3.6966 python _PyString_Eq 185 3.5252 libc-2.3.2.so __GI___strcasecmp 162 3.0869 python subtype_dealloc 158 3.0107 libc-2.3.2.so _int_malloc 147 2.8011 libnumarray.so isBufferWriteable 145 2.7630 python PyDict_SetItem 135 2.5724 _ndarray.so _view 131 2.4962 python PyObject_GenericGetAttr 122 2.3247 python PyDict_GetItem 100 1.9055 python PyString_InternInPlace 94 1.7912 libnumarray.so getReadBufferDataPtr 77 1.4672 _ndarray.so _simpleIndexingCore i.e. time spent in libc and libnumarray is going up in the list, as it should. Now, we have to concentrate in other points of optimization. Perhaps is a good time to have a try on recompiling the kernel and getting the call tree... Cheers, A Divendres 28 Gener 2005 12:48, Todd Miller va escriure:
I got some insight into what I think is the tall pole in the profile: sub-array creation is implemented using views. The generic indexing code does a view() Python callback because object arrays override view (). Faster view() creation for numerical arrays can be achieved like this by avoiding the callback:
Index: Src/_ndarraymodule.c =================================================================== RCS file: /cvsroot/numpy/numarray/Src/_ndarraymodule.c,v retrieving revision 1.75 diff -c -r1.75 _ndarraymodule.c *** Src/_ndarraymodule.c 14 Jan 2005 14:13:22 -0000 1.75 --- Src/_ndarraymodule.c 28 Jan 2005 11:15:50 -0000 *************** *** 453,460 **** } } else { /* partially subscripted --> subarray */ long i; ! result = (PyArrayObject *) ! PyObject_CallMethod((PyObject *) self,"view",NULL); if (!result) goto _exit;
result->nd = result->nstrides = self->nd - nindices; --- 453,463 ---- } } else { /* partially subscripted --> subarray */ long i; ! if (NA_NumArrayCheck((PyObject *)self)) ! result = _view(self); ! else ! result = (PyArrayObject *) PyObject_CallMethod( ! (PyObject *) self,"view",NULL); if (!result) goto _exit;
result->nd = result->nstrides = self->nd - nindices;
I committed the patch above to CVS for now. This optimization makes view() "non-overridable" for NumArray subclasses so there is probably a better way of doing this.
One other thing that struck me looking at your profile, and it has been discussed before, is that NumArray.__del__() needs to be pushed (back) down into C. Getting rid of __del__ would also synergyze well with making an object freelist, one aspect of which is capturing unneeded objects rather than destroying them.
Thanks for the profile.
Regards, Todd
On Thu, 2005-01-27 at 21:36 +0100, Francesc Altet wrote:
Hi,
After a while of waiting for some free time, I'm playing myself with the excellent oprofile, and try to help in reducing numarray creation.
For that goal, I selected the next small benchmark:
import numarray a = numarray.arange(2000) a.shape=(1000,2) for j in xrange(1000): for i in range(len(a)): row=a[i]
I know that it mixes creation with indexing cost, but as the indexing cost of numarray is only a bit slower (perhaps a 40%) than Numeric, while array creation time is 5 to 10 times slower, I think this benchmark may provide a good starting point to see what's going on.
For numarray, I've got the next results:
samples % image name symbol name 902 7.3238 python PyEval_EvalFrame 835 6.7798 python lookdict_string 408 3.3128 python PyObject_GenericGetAttr 384 3.1179 python PyDict_GetItem 383 3.1098 libc-2.3.2.so memcpy 358 2.9068 libpthread-0.10.so __pthread_alt_unlock 293 2.3790 python _PyString_Eq 273 2.2166 libnumarray.so NA_updateStatus 273 2.2166 python PyType_IsSubtype 271 2.2004 python countformat 252 2.0461 libc-2.3.2.so memset 249 2.0218 python string_hash 248 2.0136 _ndarray.so _universalIndexing
while for Numeric I've got this:
samples % image name symbol name 279 15.6478 libpthread-0.10.so __pthread_alt_unlock 216 12.1144 libc-2.3.2.so memmove 187 10.4879 python lookdict_string 162 9.0858 python PyEval_EvalFrame 144 8.0763 libpthread-0.10.so __pthread_alt_lock 126 7.0667 libpthread-0.10.so __pthread_alt_trylock 56 3.1408 python PyDict_SetItem 53 2.9725 libpthread-0.10.so __GI___pthread_mutex_unlock 45 2.5238 _numpy.so PyArray_FromDimsAndDataAndDescr 39 2.1873 libc-2.3.2.so __malloc 36 2.0191 libc-2.3.2.so __cfree
one preliminary result is that numarray spends a lot more time in Python space than do Numeric, as Todd already said here. The problem is that, as I have not yet patched my kernel, I can't get the call tree, and I can't look for the ultimate responsible for that.
So, I've tried to run the profile module included in the standard library in order to see which are the hot spots in python:
$ time ~/python.nobackup/Python-2.4/python -m profile -s time create-numarray.py 1016105 function calls (1016064 primitive calls) in 25.290 CPU seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function) 1 19.220 19.220 25.290 25.290 create-numarray.py:1(?) 999999 5.530 0.000 5.530 0.000 numarraycore.py:514(__del__) 1753 0.160 0.000 0.160 0.000 :0(eval) 1 0.060 0.060 0.340 0.340 numarraycore.py:3(?) 1 0.050 0.050 0.390 0.390 generic.py:8(?) 1 0.040 0.040 0.490 0.490 numarrayall.py:1(?) 3455 0.040 0.000 0.040 0.000 :0(len) 1 0.030 0.030 0.190 0.190 ufunc.py:1504(_makeCUFuncDict) 51 0.030 0.001 0.070 0.001 ufunc.py:184(_nIOArgs) 3572 0.030 0.000 0.030 0.000 :0(has_key) 2582 0.020 0.000 0.020 0.000 :0(append) 1000 0.020 0.000 0.020 0.000 :0(range) 1 0.010 0.010 0.010 0.010 generic.py:510 (_stridesFromShape) 42/1 0.010 0.000 25.290 25.290 <string>:1(?)
but, to say the truth, I can't really see where the time is exactly consumed. Perhaps somebody with more experience can put more light on this?
Another thing that I find intriguing has to do with Numeric and oprofile output. Let me remember:
samples % image name symbol name 279 15.6478 libpthread-0.10.so __pthread_alt_unlock 216 12.1144 libc-2.3.2.so memmove 187 10.4879 python lookdict_string 162 9.0858 python PyEval_EvalFrame 144 8.0763 libpthread-0.10.so __pthread_alt_lock 126 7.0667 libpthread-0.10.so __pthread_alt_trylock 56 3.1408 python PyDict_SetItem 53 2.9725 libpthread-0.10.so __GI___pthread_mutex_unlock 45 2.5238 _numpy.so PyArray_FromDimsAndDataAndDescr 39 2.1873 libc-2.3.2.so __malloc 36 2.0191 libc-2.3.2.so __cfree
we can see that a lot of the time in the benchmark using Numeric is consumed in libc space (a 37% or so). However, only a 16% is used in memory-related tasks (memmove, malloc and free) while the rest seems to be used in thread issues (??). Again, anyone can explain why the pthread* routines take so many time, or why they appear here at all?. Perhaps getting rid of these calls might improve the Numeric performance even further.
Cheers,
------------------------------------------------------- This SF.Net email is sponsored by: IntelliVIEW -- Interactive Reporting Tool for open source databases. Create drag-&-drop reports. Save time by over 75%! Publish reports on the web. Export to DOC, XLS, RTF, etc. Download a FREE copy at http://www.intelliview.com/go/osdn_nl _______________________________________________ Numpy-discussion mailing list Numpy-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/numpy-discussion
--
qo< Francesc Altet http://www.carabos.com/ V V Cárabos Coop. V. Enjoy Data ""