[Numpy-discussion] Profiling numpy ? (parts written in C)

Wed Dec 20 07:08:34 EST 2006

A Dimecres 20 Desembre 2006 03:36, David Cournapeau escrigué:
> Francesc Altet wrote:
> > A Dimarts 19 Desembre 2006 08:12, David Cournapeau escrigué:
> >> Hi,
> >>
> >>    Following the discussion on clip and other functions which *may* be
> >> slow in numpy, I would like to know if there is a way to easily profile
> >> numpy, ie functions which are written in C.
> >>    For example, I am not sure to understand why a function like take(a,
> >> b) with a a double 256x4 array and b a 8000x256 int array takes almost
> >> 200 ms on a fairly fast CPU; in the source code, I can see that numpy
> >> uses memmove, and I know memmove to be slower than memcpy. Is there an
> >> easy way to check that this is coming from memmove (case in which
> >> nothing much can be done to improve the situation I guess), and not from
> >> something else ?
>
> Concerning the memmove vs memcpy: the problem I was speaking about is in
> another function (numpy take), where the problem is much bigger speed wise.

Ops. Yes, you are right. Out of curiosity, I've looked into this as
well, and created a small script to benchmark take(), which is listed
at the end of the message.

I've used a = double 256x4 and b = int 80x256 (my b is quite smaller
than yours, mainly because of the time that it takes to generate with
my current approach; by the way, if anybody knows an easy way to do a
cartesian product with numpy, please, tell me), but I don't think this
is going to influence the timings.

With it, here is the cProfile for take(a,b) (1000 iterations):

         2862 function calls in 6.907 CPU seconds

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     1000    6.679    0.007    6.679    0.007 {method 'take' 
of 'numpy.ndarray' objects}
        1    0.138    0.138    0.138    0.138 {numpy.core.multiarray.array}
        1    0.059    0.059    6.907    6.907 prova.py:30(bench_take)

and here the output from oprofile for the same benchmark:

Profiling through timer interrupt
samples  %        image name               symbol name
1360     83.0281  libc-2.3.6.so            memmove
167      10.1954  multiarray.so            PyArray_TakeFrom
11        0.6716  multiarray.so            .plt

and, if we replace memmove by memcpy:

Profiling through timer interrupt
samples  %        image name               symbol name
1307     82.0980  libc-2.3.6.so            memcpy
178      11.1809  multiarray.so            PyArray_TakeFrom
13        0.8166  multiarray.so            .plt


So, again, it seems like we have the same pattern than with .clip()
method: the bottleneck is in calling memmove (the same would go for
memcpy) for every element in the array a that has to be taken.

A workaround for speeding this up (as suggested by Travis in his
excellent book) is to replace:

c = take(a, b)

by

c = a.flat[b]

Here are the cProfile results for the version using fancy indexing:

         862 function calls in 3.455 CPU seconds

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    3.300    3.300    3.455    3.455 prova.py:30(bench_take)
        1    0.132    0.132    0.132    0.132 {numpy.core.multiarray.array}
      257    0.018    0.000    0.018    0.000 {map}

which is 2x faster than the take approach.

The oprofile output for the fancy indexing approach:

samples  %        image name               symbol name
426      53.2500  multiarray.so            iter_subscript
277      34.6250  multiarray.so            DOUBLE_copyswap
9         1.1250  python2.5                PyString_FromFormatV

seems to tell us that memmove/memcopy are not called at all, but
instead the DOUBLE_copyswap function. This is in fact an apparence,
because if we look at the code of DOUBLE_copyswap (found in
arraytypes.inc.src):

@fname at _copyswap (void *dst, void *src, int swap, void *arr)
{

         if (src != NULL) /* copy first if needed */
                memcpy(dst, src, sizeof(@type@));

[where the numpy code generator is replacing @fname@ by DOUBLE]

we see that memcpy is called under the hood (I don't know why oprofile
is not able to detect this call anymore).

After looking at the function, and remembering what Charles Harris
said in a previous message about the convenience to use a simple type
specific assignment, I've ended replacing the memcpy. Here it is the
patch:

--- numpy/core/src/arraytypes.inc.src   (revision 3487)
+++ numpy/core/src/arraytypes.inc.src   (working copy)
@@ -997,11 +997,11 @@
 }

 static void
- at fname@_copyswap (void *dst, void *src, int swap, void *arr)
+ at fname@_copyswap (@type@ *dst, @type@ *src, int swap, void *arr)
 {

         if (src != NULL) /* copy first if needed */
-                memcpy(dst, src, sizeof(@type@));
+                *dst = *src;

         if (swap) {
                 register char *a, *b, c;

and after this, timings seems to improve a bit. With CProfile:

         862 function calls in 3.251 CPU seconds

   Ordered by: internal time, call count

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    3.092    3.092    3.251    3.251 prova.py:31(bench_take)
        1    0.135    0.135    0.135    0.135 {numpy.core.multiarray.array}
      257    0.018    0.000    0.018    0.000 {map}

which is around a 6% faster. With oprofile:

samples  %        image name               symbol name
525      64.7349  multiarray.so            iter_subscript
186      22.9346  multiarray.so            DOUBLE_copyswap
8         0.9864  python2.5                PyString_FromFormatV

so, DOUBLE_copyswap seems around a 50% faster (186 samples vs 277) now
due to the use of the type specific assignment trick.

It seems to me that the above patch is safe, and besides, the complete
test suite in numpy passes (in fact, it runs around a 6% faster), so
perhaps it would be a nice thing to apply it. In this sense, it would
be good to do a overhauling of the NumPy code so as to discover other
places where this trick can be applied.

There is still a long way to get an optimal replacement for the take()
method (can iter_subscript be further optimised?), but this may help a
bit.

Cheers,

--------------------------------------------------------------------------
import numpy

niter = 1000
d0, d1 = (256, 4)
#d0, d1 = (6, 4)

def generate_data_2d(d0, d1):
    return numpy.random.randn(d0, d1)

def cartesian(Lists):
  import operator
  if Lists:
    result = map(lambda I: (I,), Lists[0])

    for list in Lists[1:]:
      curr = []
      for item in list:
        new = map(operator.add, result, [(item,)]*len(result))
        curr[len(curr):] = new
      result = curr
  else:
    result = []

  return result

def generate_data_indices(d0, d1, N1, N2):
    iaxis = numpy.random.randint(0, d0, N1)
    jaxis = numpy.random.randint(0, d1, N2)
    return numpy.array(cartesian([iaxis.tolist(), jaxis.tolist()]))

def bench_take():
    a = generate_data_2d(d0, d1)
    b = generate_data_indices(d0, d1, 80, 256)
    #b = generate_data_indices(d0, d1, 8, 2)

    for i in range(niter):
        #c = numpy.take(a, b)
        c = a.flat[b]  # equivalent using fancy indexing

if __name__ == '__main__':
    # test take
    import pstats
    import cProfile as prof

    profile_wanted = False
    if not profile_wanted:
        bench_take()
    else:
        profile_file = 'take.prof'
        prof.run('bench_take()', profile_file)
        stats = pstats.Stats(profile_file)
        stats.strip_dirs()
        stats.sort_stats('time', 'calls')
        stats.print_stats()

--------------------------------------------------------------------------

-- 
>0,0<   Francesc Altet     http://www.carabos.com/
V   V   Cárabos Coop. V.   Enjoy Data
 "-"