[Numpy-discussion] numpy.concatenate slower than slice copying

Wed Aug 18 17:31:56 EDT 2010

On Wed, Aug 18, 2010 at 02:06:51AM +0100, Francesc Alted wrote:
> Hey Zbyszek,
> 
> 2010/8/17, Zbyszek Szmek <zbyszek at in.waw.pl>:
> > Hi,
> > this is a problem which came up when trying to replace a hand-written
> > array concatenation with a call to numpy.vstack:
> > for some array sizes,
> >
> >    numpy.vstack(data)
> >
> > runs > 20% longer than a loop like
> >
> >    alldata = numpy.empty((tlen, dim))
> >    for x in data:
> >         step = x.shape[0]
> >         alldata[pos:pos+step] = x
> >         pos += step
> >
> > (example script attached)
> [clip]
> 
> I was curious on what is happening here, so after some profiling with
> cachegrind, I've come to the conclusion that `numpy.concatenate` is
> using the `memcpy` system call so as to copy data from sources to
> recipient.  On his hand, your `concat` function is making use of the
> `__setitem__` method of ndarray, which does not use `memcpy` (this is
> probably due to the fact that it has to deal with strides).

Hey Francesc,

thank you for your detailed answer. It seems that memcpy which should always
be faster then memmove is sometimes slower! What happens is that
using the slice assignment calls memmove() which calls _wordcopy_fwd_aligned() [1]
which is apparently faster than memcpy() [2]

[1] http://www.eglibc.org/cgi-bin/viewcvs.cgi/trunk/libc/string/wordcopy.c?rev=77&view=auto
[2] http://www.eglibc.org/cgi-bin/viewcvs.cgi/trunk/libc/sysdeps/x86_64/memcpy.S?rev=11186&view=markup

I guess that you're not seeing the difference because I'm using an
amd64 specific memcpy written in assembly, and you're using a i586
implementation.  I've tried to reproduce the problem in a C program,
but there the memcpy is always much faster than memmove, as should be.

I've verified that the difference between memcpy and memmove is the
problem by patching array_concatenate to always use memmove:

diff --git a/numpy/core/src/multiarray/multiarraymodule.c
b/numpy/core/src/multiarray/multia
index de63f33..e7f8643 100644
--- a/numpy/core/src/multiarray/multiarraymodule.c
+++ b/numpy/core/src/multiarray/multiarraymodule.c
@@ -437,7 +437,7 @@ PyArray_Concatenate(PyObject *op, int axis)
     data = ret->data;
     for (i = 0; i < n; i++) {
         numbytes = PyArray_NBYTES(mps[i]);
-        memcpy(data, mps[i]->data, numbytes);
+        memmove(data, mps[i]->data, numbytes);
         data += numbytes;
     }

which gives a speedup the same as using the slice assignment:
zbyszek at ameba ~/mdp/tmp % python2.6 del_cum3.py numpy 10000 1000 10 10
problem size: (10000x1000) x 10 = 10^8
0.814s  <----- without the patch

zbyszek at ameba ~/mdp/tmp % PYTHONPATH=/var/tmp/install/lib/python2.6/site-packages python2.6  del_cum3.py numpy 10000 1000 10 10
problem size: (10000x1000) x 10 = 10^8
0.637s  <----- with the stupid patch

> Now, it turns out that `memcpy` may be not optimal for every platform,
> and a direct fetch and assign approach could be sometimes faster.  My
> guess is that this is what is happening in your case.  On my machine,
> running latest Ubuntu Linux, I'm not seeing this difference though:
> 
> faltet at ubuntu:~/carray$ python bench/concat.py numpy 1000 1000 10 3
> problem size: (1000x1000) x 10 = 10^7
> 0.247s
> faltet at ubuntu:~/carray$ python bench/concat.py concat 1000 1000 10 3
> problem size: (1000x1000) x 10 = 10^7
> 0.246s
> 
> and neither when running Windows (XP):
> 
> C:\tmp>python del_cum3.py numpy 10000 1000 1 10
> problem size: (10000x1000) x 1 = 10^7
> 0.227s
> 
> C:\tmp>python del_cum3.py concat 10000 1000 1 10
> problem size: (10000x1000) x 1 = 10^7
> 0.223s
Probably the architecture (and thus glibc implementation) is more
important than the operating system. But the problem is very much
dependent on the size of the arrays, so probably on aligment and other
details

N.B. with the the parameters above I get 0.081s vs. 0.066s.



> Now the new method (carray) with compression level 1 (note the new
> parameter at the end of the command line):
> 
> faltet at ubuntu:~/carray$ PYTHONPATH=. python bench/concat.py carray
> 1000000 10 3 1
> problem size: (1000000) x 10 = 10^7
> time for concat: 0.186s
> size of the final container: 5.076 MB

This looks very interesting! Do you think it would be possible to
automatically 'guess' if such compression makes sense and just use
it behind the scenes as 'decompress-on-write'? I'll try to do some
benchmarking tomorrow...

Thanks,
Zbyszek