[Numpy-discussion] Alternative to record array

Francesc Alted faltet at pytables.org
Tue Dec 30 10:34:27 EST 2008


A Tuesday 30 December 2008, Francesc Alted escrigué:
> A Monday 29 December 2008, Jean-Baptiste Rudant escrigué:
[snip]
>
> The difference for both approaches is that the row-wise arrangement
> is more efficient when data is iterated by field, while the
> column-wise one is more efficient when data is iterated by column. 
> This is why you are seeing the increase of 4x in performance
> --incidentally, by looking at both data arrangements, I'd expect an
> increase of just 2x (the stride count is 2 in this case), but I
> suspect that there are hidden copies during the increment operation
> for the record array case.

As I was mystified about this difference in speed, I kept investigating 
and I think I have an answer for the difference in the expected 
speed-up in the unary increment operator over a recarray field.  After 
looking at the numpy code, it turns out that the next statement:

data.ages += 1

is more or less equivalent to:

a = data.ages
a[:] = a + 1

i.e. a temporary is created (for keeping the result of 'a + 1') and then 
assigned to the 'ages' column.  As it happens that, in this sort of 
operations, the memory copies are the bottleneck, the creation of the 
first temporary introduced a slowdown of 2x (due to the strided column) 
and the assignment represents the additional 2x (4x in total).  
However, the next idiom:

a = data.ages
a += 1

effectively removes the need for the temporary copy and is 2x faster 
than the original "data.ages += 1".  This can be seen in the next 
simple benchmark:

---------------------------
import numpy, timeit

count = 10e6
ages  = numpy.random.randint(0,100,count)
weights = numpy.random.randint(1,200,count)
data = numpy.rec.fromarrays((ages,weights),names='ages,weights')

timer = timeit.Timer('data.ages += 1','from __main__ import data')
print "v0-->", timer.timeit(number=10)
timer = timeit.Timer('a=data.ages; a[:] = a + 1','from __main__ import 
data')
print "v1-->", timer.timeit(number=10)
timer = timeit.Timer('a=data.ages; a += 1','from __main__ import data')
print "v2-->", timer.timeit(number=10)
timer = timeit.Timer('ages += 1','from __main__ import ages')
print "v3-->", timer.timeit(number=10)
---------------------------

which produces the next output on my laptop:

v0--> 2.98340201378
v1--> 3.22748112679
v2--> 1.5474319458
v3--> 0.809724807739

As a final comment, I suppose that unary operators (+=, -=...) can be 
optimized in the context of recarray columns in numpy, but I don't 
think it is worth the effort:  when really high performance is needed 
for operating with columns in the context of recarrays, a column-wise 
approach is best.

Cheers,

-- 
Francesc Alted



More information about the NumPy-Discussion mailing list