[Numpy-discussion] Speeding up Numeric

Fri Jan 28 03:48:05 EST 2005

I got some insight into what I think is the tall pole in the profile:
sub-array creation is implemented using views.  The generic indexing
code does a view() Python callback because object arrays override view
().  Faster view() creation for numerical arrays can be achieved like
this by avoiding the callback:

Index: Src/_ndarraymodule.c
===================================================================
RCS file: /cvsroot/numpy/numarray/Src/_ndarraymodule.c,v
retrieving revision 1.75
diff -c -r1.75 _ndarraymodule.c
*** Src/_ndarraymodule.c        14 Jan 2005 14:13:22 -0000      1.75
--- Src/_ndarraymodule.c        28 Jan 2005 11:15:50 -0000
***************
*** 453,460 ****
                }
        } else {  /* partially subscripted --> subarray */
                long i;
!               result = (PyArrayObject *)
!                       PyObject_CallMethod((PyObject *)
self,"view",NULL);
                if (!result) goto _exit;

                result->nd = result->nstrides = self->nd - nindices;
--- 453,463 ----
                }
        } else {  /* partially subscripted --> subarray */
                long i;
!               if (NA_NumArrayCheck((PyObject *)self))
!                       result = _view(self);
!               else
!                       result = (PyArrayObject *) PyObject_CallMethod(
!                               (PyObject *) self,"view",NULL);
                if (!result) goto _exit;

                result->nd = result->nstrides = self->nd - nindices;

I committed the patch above to CVS for now.  This optimization makes
view() "non-overridable" for NumArray subclasses so there is probably a
better way of doing this.

One other thing that struck me looking at your profile,  and it has been
discussed before,  is that NumArray.__del__() needs to be pushed (back)
down into C.   Getting rid of __del__ would also synergyze well with
making an object freelist,  one aspect of which is capturing unneeded
objects rather than destroying them.

Thanks for the profile.

Regards,
Todd

On Thu, 2005-01-27 at 21:36 +0100, Francesc Altet wrote:
> Hi,
> 
> After a while of waiting for some free time, I'm playing myself with
> the excellent oprofile, and try to help in reducing numarray creation.
> 
> For that goal, I selected the next small benchmark:
> 
> import numarray
> a = numarray.arange(2000)
> a.shape=(1000,2)
> for j in xrange(1000):
>     for i in range(len(a)):
>         row=a[i]
> 
> I know that it mixes creation with indexing cost, but as the indexing
> cost of numarray is only a bit slower (perhaps a 40%) than Numeric,
> while array creation time is 5 to 10 times slower, I think this
> benchmark may provide a good starting point to see what's going on.
> 
> For numarray, I've got the next results:
> 
> samples  %        image name               symbol name
> 902       7.3238  python                   PyEval_EvalFrame
> 835       6.7798  python                   lookdict_string
> 408       3.3128  python                   PyObject_GenericGetAttr
> 384       3.1179  python                   PyDict_GetItem
> 383       3.1098  libc-2.3.2.so            memcpy
> 358       2.9068  libpthread-0.10.so       __pthread_alt_unlock
> 293       2.3790  python                   _PyString_Eq
> 273       2.2166  libnumarray.so           NA_updateStatus
> 273       2.2166  python                   PyType_IsSubtype
> 271       2.2004  python                   countformat
> 252       2.0461  libc-2.3.2.so            memset
> 249       2.0218  python                   string_hash
> 248       2.0136  _ndarray.so              _universalIndexing
> 
> while for Numeric I've got this:
> 
> samples  %        image name               symbol name
> 279      15.6478  libpthread-0.10.so       __pthread_alt_unlock
> 216      12.1144  libc-2.3.2.so            memmove
> 187      10.4879  python                   lookdict_string
> 162       9.0858  python                   PyEval_EvalFrame
> 144       8.0763  libpthread-0.10.so       __pthread_alt_lock
> 126       7.0667  libpthread-0.10.so       __pthread_alt_trylock
> 56        3.1408  python                   PyDict_SetItem
> 53        2.9725  libpthread-0.10.so       __GI___pthread_mutex_unlock
> 45        2.5238  _numpy.so                PyArray_FromDimsAndDataAndDescr
> 39        2.1873  libc-2.3.2.so            __malloc
> 36        2.0191  libc-2.3.2.so            __cfree
> 
> one preliminary result is that numarray spends a lot more time in
> Python space than do Numeric, as Todd already said here. The problem
> is that, as I have not yet patched my kernel, I can't get the call
> tree, and I can't look for the ultimate responsible for that.
> 
> So, I've tried to run the profile module included in the standard
> library in order to see which are the hot spots in python:
> 
> $ time ~/python.nobackup/Python-2.4/python -m profile -s time 
> create-numarray.py
>          1016105 function calls (1016064 primitive calls) in 25.290 CPU 
> seconds
> 
>    Ordered by: internal time
> 
>    ncalls  tottime  percall  cumtime  percall filename:lineno(function)
>         1   19.220   19.220   25.290   25.290 create-numarray.py:1(?)
>    999999    5.530    0.000    5.530    0.000 numarraycore.py:514(__del__)
>      1753    0.160    0.000    0.160    0.000 :0(eval)
>         1    0.060    0.060    0.340    0.340 numarraycore.py:3(?)
>         1    0.050    0.050    0.390    0.390 generic.py:8(?)
>         1    0.040    0.040    0.490    0.490 numarrayall.py:1(?)
>      3455    0.040    0.000    0.040    0.000 :0(len)
>         1    0.030    0.030    0.190    0.190 ufunc.py:1504(_makeCUFuncDict)
>        51    0.030    0.001    0.070    0.001 ufunc.py:184(_nIOArgs)
>      3572    0.030    0.000    0.030    0.000 :0(has_key)
>      2582    0.020    0.000    0.020    0.000 :0(append)
>      1000    0.020    0.000    0.020    0.000 :0(range)
>         1    0.010    0.010    0.010    0.010 generic.py:510
> (_stridesFromShape)
>      42/1    0.010    0.000   25.290   25.290 <string>:1(?)
> 
> but, to say the truth, I can't really see where the time is exactly
> consumed. Perhaps somebody with more experience can put more light on
> this?
> 
> Another thing that I find intriguing has to do with Numeric and
> oprofile output. Let me remember:
> 
> samples  %        image name               symbol name
> 279      15.6478  libpthread-0.10.so       __pthread_alt_unlock
> 216      12.1144  libc-2.3.2.so            memmove
> 187      10.4879  python                   lookdict_string
> 162       9.0858  python                   PyEval_EvalFrame
> 144       8.0763  libpthread-0.10.so       __pthread_alt_lock
> 126       7.0667  libpthread-0.10.so       __pthread_alt_trylock
> 56        3.1408  python                   PyDict_SetItem
> 53        2.9725  libpthread-0.10.so       __GI___pthread_mutex_unlock
> 45        2.5238  _numpy.so                PyArray_FromDimsAndDataAndDescr
> 39        2.1873  libc-2.3.2.so            __malloc
> 36        2.0191  libc-2.3.2.so            __cfree
> 
> we can see that a lot of the time in the benchmark using Numeric is
> consumed in libc space (a 37% or so). However, only a 16% is used in
> memory-related tasks (memmove, malloc and free) while the rest seems
> to be used in thread issues (??). Again, anyone can explain why the
> pthread* routines take so many time, or why they appear here at all?.
> Perhaps getting rid of these calls might improve the Numeric
> performance even further.
> 
> Cheers,
>