Re: [Numpy-discussion] Speeding up Numeric

28 Jan 2005

      Hi Todd,

Nice to see that you can achieved a good speed-up with your
optimization path. With the next code:

import numarray
a = numarray.arange(2000)
a.shape=(1000,2)
for j in xrange(1000):
    for i in range(len(a)): 
 row=a[i]

and original numarray-1.1.1 it took 11.254s (pentium4@2GHz). With your
patch, this time has been reduced to 7.816s. Now, following your
suggestion to push NumArray.__del__ down into C, I've got a good
speed-up as well: 5.332s. This is more that twice as fast as the
unpatched numarray 1.1.1. There is still a long way until we can catch
Numeric (1.123s), but it is a first step :)

The patch. Please, revise it as I'm not very used with dealing with
pure C extensions (just a Pyrex user):

Index: Lib/numarraycore.py
===================================================================
RCS file: /cvsroot/numpy/numarray/Lib/numarraycore.py,v
retrieving revision 1.101
diff -r1.101 numarraycore.py
696,699c696,699
<     def __del__(self):
<         if self._shadows != None:
<             self._shadows._copyFrom(self)
<             self._shadows = None
---
...
def __del__(self):
          if self._shadows != None:
              self._shadows._copyFrom(self)
              self._shadows = None
Index: Src/_numarraymodule.c
===================================================================
RCS file: /cvsroot/numpy/numarray/Src/_numarraymodule.c,v
retrieving revision 1.65
diff -r1.65 _numarraymodule.c
399a400,411
...
static void
_numarray_dealloc(PyObject *self)
{
  PyArrayObject *selfa = (PyArrayObject *) self;
if (selfa->_shadows != NULL) {
    _copyFrom(selfa->_shadows, self);
    selfa->_shadows = NULL;
  }
  self->ob_type->tp_free(self);
}
421c433
<       0,                                      /* tp_dealloc */
---
...
_numarray_dealloc,                      /* tp_dealloc */
The profile with the new optimizations looks now like:

samples  %        image name               symbol name
453       8.6319  python                   PyEval_EvalFrame
372       7.0884  python                   lookdict_string
349       6.6502  python                   string_hash
271       5.1639  libc-2.3.2.so            _wordcopy_bwd_aligned
210       4.0015  libnumarray.so           NA_updateStatus
194       3.6966  python                   _PyString_Eq
185       3.5252  libc-2.3.2.so            __GI___strcasecmp
162       3.0869  python                   subtype_dealloc
158       3.0107  libc-2.3.2.so            _int_malloc
147       2.8011  libnumarray.so           isBufferWriteable
145       2.7630  python                   PyDict_SetItem
135       2.5724  _ndarray.so              _view
131       2.4962  python                   PyObject_GenericGetAttr
122       2.3247  python                   PyDict_GetItem
100       1.9055  python                   PyString_InternInPlace
94        1.7912  libnumarray.so           getReadBufferDataPtr
77        1.4672  _ndarray.so              _simpleIndexingCore

i.e. time spent in libc and libnumarray is going up in the list, as it
should. Now, we have to concentrate in other points of optimization.
Perhaps is a good time to have a try on recompiling the kernel and
getting the call tree...

Cheers,

A Divendres 28 Gener 2005 12:48, Todd Miller va escriure:
...
I got some insight into what I think is the tall pole in the profile:
sub-array creation is implemented using views.  The generic indexing
code does a view() Python callback because object arrays override view
().  Faster view() creation for numerical arrays can be achieved like
this by avoiding the callback:
Index: Src/_ndarraymodule.c
===================================================================
RCS file: /cvsroot/numpy/numarray/Src/_ndarraymodule.c,v
retrieving revision 1.75
diff -c -r1.75 _ndarraymodule.c
*** Src/_ndarraymodule.c        14 Jan 2005 14:13:22 -0000      1.75
--- Src/_ndarraymodule.c        28 Jan 2005 11:15:50 -0000
***************
*** 453,460 ****
                }
        } else {  /* partially subscripted --> subarray */
                long i;
!               result = (PyArrayObject *)
!                       PyObject_CallMethod((PyObject *)
self,"view",NULL);
                if (!result) goto _exit;
result->nd = result->nstrides = self->nd - nindices;
--- 453,463 ----
                }
        } else {  /* partially subscripted --> subarray */
                long i;
!               if (NA_NumArrayCheck((PyObject *)self))
!                       result = _view(self);
!               else
!                       result = (PyArrayObject *) PyObject_CallMethod(
!                               (PyObject *) self,"view",NULL);
                if (!result) goto _exit;
result->nd = result->nstrides = self->nd - nindices;
I committed the patch above to CVS for now.  This optimization makes
view() "non-overridable" for NumArray subclasses so there is probably a
better way of doing this.
One other thing that struck me looking at your profile,  and it has been
discussed before,  is that NumArray.__del__() needs to be pushed (back)
down into C.   Getting rid of __del__ would also synergyze well with
making an object freelist,  one aspect of which is capturing unneeded
objects rather than destroying them.
Thanks for the profile.
Regards,
Todd
On Thu, 2005-01-27 at 21:36 +0100, Francesc Altet wrote:
...
Hi,
After a while of waiting for some free time, I'm playing myself with
the excellent oprofile, and try to help in reducing numarray creation.
For that goal, I selected the next small benchmark:
import numarray
a = numarray.arange(2000)
a.shape=(1000,2)
for j in xrange(1000):
    for i in range(len(a)):
        row=a[i]
I know that it mixes creation with indexing cost, but as the indexing
cost of numarray is only a bit slower (perhaps a 40%) than Numeric,
while array creation time is 5 to 10 times slower, I think this
benchmark may provide a good starting point to see what's going on.
For numarray, I've got the next results:
samples  %        image name               symbol name
902       7.3238  python                   PyEval_EvalFrame
835       6.7798  python                   lookdict_string
408       3.3128  python                   PyObject_GenericGetAttr
384       3.1179  python                   PyDict_GetItem
383       3.1098  libc-2.3.2.so            memcpy
358       2.9068  libpthread-0.10.so       __pthread_alt_unlock
293       2.3790  python                   _PyString_Eq
273       2.2166  libnumarray.so           NA_updateStatus
273       2.2166  python                   PyType_IsSubtype
271       2.2004  python                   countformat
252       2.0461  libc-2.3.2.so            memset
249       2.0218  python                   string_hash
248       2.0136  _ndarray.so              _universalIndexing
while for Numeric I've got this:
samples  %        image name               symbol name
279      15.6478  libpthread-0.10.so       __pthread_alt_unlock
216      12.1144  libc-2.3.2.so            memmove
187      10.4879  python                   lookdict_string
162       9.0858  python                   PyEval_EvalFrame
144       8.0763  libpthread-0.10.so       __pthread_alt_lock
126       7.0667  libpthread-0.10.so       __pthread_alt_trylock
56        3.1408  python                   PyDict_SetItem
53        2.9725  libpthread-0.10.so       __GI___pthread_mutex_unlock
45        2.5238  _numpy.so               
PyArray_FromDimsAndDataAndDescr 39        2.1873  libc-2.3.2.so          
 __malloc
36        2.0191  libc-2.3.2.so            __cfree
one preliminary result is that numarray spends a lot more time in
Python space than do Numeric, as Todd already said here. The problem
is that, as I have not yet patched my kernel, I can't get the call
tree, and I can't look for the ultimate responsible for that.
So, I've tried to run the profile module included in the standard
library in order to see which are the hot spots in python:
$ time ~/python.nobackup/Python-2.4/python -m profile -s time
create-numarray.py
         1016105 function calls (1016064 primitive calls) in 25.290 CPU
seconds
Ordered by: internal time
ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1   19.220   19.220   25.290   25.290 create-numarray.py:1(?)
   999999    5.530    0.000    5.530    0.000
numarraycore.py:514(__del__) 1753    0.160    0.000    0.160    0.000
:0(eval)
        1    0.060    0.060    0.340    0.340 numarraycore.py:3(?)
        1    0.050    0.050    0.390    0.390 generic.py:8(?)
        1    0.040    0.040    0.490    0.490 numarrayall.py:1(?)
     3455    0.040    0.000    0.040    0.000 :0(len)
        1    0.030    0.030    0.190    0.190
ufunc.py:1504(_makeCUFuncDict) 51    0.030    0.001    0.070    0.001
ufunc.py:184(_nIOArgs) 3572    0.030    0.000    0.030    0.000
:0(has_key)
     2582    0.020    0.000    0.020    0.000 :0(append)
     1000    0.020    0.000    0.020    0.000 :0(range)
        1    0.010    0.010    0.010    0.010 generic.py:510
(_stridesFromShape)
     42/1    0.010    0.000   25.290   25.290 <string>:1(?)
but, to say the truth, I can't really see where the time is exactly
consumed. Perhaps somebody with more experience can put more light on
this?
Another thing that I find intriguing has to do with Numeric and
oprofile output. Let me remember:
samples  %        image name               symbol name
279      15.6478  libpthread-0.10.so       __pthread_alt_unlock
216      12.1144  libc-2.3.2.so            memmove
187      10.4879  python                   lookdict_string
162       9.0858  python                   PyEval_EvalFrame
144       8.0763  libpthread-0.10.so       __pthread_alt_lock
126       7.0667  libpthread-0.10.so       __pthread_alt_trylock
56        3.1408  python                   PyDict_SetItem
53        2.9725  libpthread-0.10.so       __GI___pthread_mutex_unlock
45        2.5238  _numpy.so               
PyArray_FromDimsAndDataAndDescr 39        2.1873  libc-2.3.2.so          
 __malloc
36        2.0191  libc-2.3.2.so            __cfree
we can see that a lot of the time in the benchmark using Numeric is
consumed in libc space (a 37% or so). However, only a 16% is used in
memory-related tasks (memmove, malloc and free) while the rest seems
to be used in thread issues (??). Again, anyone can explain why the
pthread* routines take so many time, or why they appear here at all?.
Perhaps getting rid of these calls might improve the Numeric
performance even further.
Cheers,
-------------------------------------------------------
This SF.Net email is sponsored by: IntelliVIEW -- Interactive Reporting
Tool for open source databases. Create drag-&-drop reports. Save time
by over 75%! Publish reports on the web. Export to DOC, XLS, RTF, etc.
Download a FREE copy at http://www.intelliview.com/go/osdn_nl
_______________________________________________
Numpy-discussion mailing list
Numpy-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/numpy-discussion
--
...
qo<   Francesc Altet     http://www.carabos.com/
V  V   Cárabos Coop. V.   Enjoy Data
 ""