[Numpy-discussion] numpy.concatenate slower than slice copying

Tue Aug 17 21:06:51 EDT 2010

Hey Zbyszek,

2010/8/17, Zbyszek Szmek <zbyszek at in.waw.pl>:
> Hi,
> this is a problem which came up when trying to replace a hand-written
> array concatenation with a call to numpy.vstack:
> for some array sizes,
>
>    numpy.vstack(data)
>
> runs > 20% longer than a loop like
>
>    alldata = numpy.empty((tlen, dim))
>    for x in data:
>         step = x.shape[0]
>         alldata[pos:pos+step] = x
>         pos += step
>
> (example script attached)
[clip]

I was curious on what is happening here, so after some profiling with
cachegrind, I've come to the conclusion that `numpy.concatenate` is
using the `memcpy` system call so as to copy data from sources to
recipient.  On his hand, your `concat` function is making use of the
`__setitem__` method of ndarray, which does not use `memcpy` (this is
probably due to the fact that it has to deal with strides).

Now, it turns out that `memcpy` may be not optimal for every platform,
and a direct fetch and assign approach could be sometimes faster.  My
guess is that this is what is happening in your case.  On my machine,
running latest Ubuntu Linux, I'm not seeing this difference though:

faltet at ubuntu:~/carray$ python bench/concat.py numpy 1000 1000 10 3
problem size: (1000x1000) x 10 = 10^7
0.247s
faltet at ubuntu:~/carray$ python bench/concat.py concat 1000 1000 10 3
problem size: (1000x1000) x 10 = 10^7
0.246s

and neither when running Windows (XP):

C:\tmp>python del_cum3.py numpy 10000 1000 1 10
problem size: (10000x1000) x 1 = 10^7
0.227s

C:\tmp>python del_cum3.py concat 10000 1000 1 10
problem size: (10000x1000) x 1 = 10^7
0.223s

Coincidentally, I've been lately working out a proof of concept for an
array that can hold data in-memory in compressed state (using the
high-performance Blosc compressor under the hood).  This object (I'm
calling it ``carray`` for the time being) can also be `append`-ed with
additional data, so it can be used in this concatenation use case.

So, I've setup a new benchmark based in your script (I called it
concat.py) and tried it out with your problem.  Here are the results
for my netbook wearing a humble Intel Atom processor.  First, the
figures for the initial `numpy.concatenate` and `concat` styles:

faltet at ubuntu:~/carray$ PYTHONPATH=. python bench/concat.py numpy 1000000 10 3
problem size: (1000000) x 10 = 10^7
time for concat: 0.228s
size of the final container: 76.294 MB

faltet at ubuntu:~/carray$ PYTHONPATH=. python bench/concat.py concat 1000000 10 3
problem size: (1000000) x 10 = 10^7
time for concat: 0.230s
size of the final container: 76.294 MB

Now the new method (carray) with compression level 1 (note the new
parameter at the end of the command line):

faltet at ubuntu:~/carray$ PYTHONPATH=. python bench/concat.py carray
1000000 10 3 1
problem size: (1000000) x 10 = 10^7
time for concat: 0.186s
size of the final container: 5.076 MB

which is more than a 20% faster than `numpy.concatenate` or your
`concat` method, while the space taken in memory is significantly
lower (5.1 MB vs 76.3 MB; of course, I've chosen a very compressible
dataset for this example ;-)

Even if you tell Blosc not to use compression (Blosc level 0), I can
still see a win here:

faltet at ubuntu:~/carray$ PYTHONPATH=. python bench/concat.py carray
1000000 10 3 0
problem size: (1000000) x 10 = 10^7
time for concat: 0.200s
size of the final container: 77.001 MB

which is 15% faster than the initial cases.  However, note how space
grows from the original 76.3 MB to 77.0 MB.  This is because carray
has to keep an internal buffer for accelerating the appending of small
arrays; this is the main responsible of the space overhead.

Finally, it is interesting to see the effect of forcing the use of a
single thread instead of two (Atom has support for hyper-threading):

faltet at ubuntu:~/carray$ PYTHONPATH=. python bench/concat.py carray
1000000 10 3 0
problem size: (1000000) x 10 = 10^7
time for concat: 0.210s
size of the final container: 77.001 MB

which is still a 10% faster than plain `numpy.concatenate` (remember,
based on `memcpy`).  Why carray/Blosc is faster for this case is
rather a mystery to me, but the effect is there.

I have not yet released carray publicly, but in case you want to play
with it, I've uploaded my current git repository to:

http://www.pytables.org/download/preliminary/carray-0.1.dev.tar.gz

Of course, carray is still pre-alpha, and it does not support
multidimensional arrays and you cannot modify its contents (other than
append new data) but still, it can be a lot of fun.

Cheers!

-- 
Francesc Alted