[Numpy-discussion] Alternative to record array
Francesc Alted
faltet at pytables.org
Tue Dec 30 10:34:27 EST 2008
A Tuesday 30 December 2008, Francesc Alted escrigué:
> A Monday 29 December 2008, Jean-Baptiste Rudant escrigué:
[snip]
>
> The difference for both approaches is that the row-wise arrangement
> is more efficient when data is iterated by field, while the
> column-wise one is more efficient when data is iterated by column.
> This is why you are seeing the increase of 4x in performance
> --incidentally, by looking at both data arrangements, I'd expect an
> increase of just 2x (the stride count is 2 in this case), but I
> suspect that there are hidden copies during the increment operation
> for the record array case.
As I was mystified about this difference in speed, I kept investigating
and I think I have an answer for the difference in the expected
speed-up in the unary increment operator over a recarray field. After
looking at the numpy code, it turns out that the next statement:
data.ages += 1
is more or less equivalent to:
a = data.ages
a[:] = a + 1
i.e. a temporary is created (for keeping the result of 'a + 1') and then
assigned to the 'ages' column. As it happens that, in this sort of
operations, the memory copies are the bottleneck, the creation of the
first temporary introduced a slowdown of 2x (due to the strided column)
and the assignment represents the additional 2x (4x in total).
However, the next idiom:
a = data.ages
a += 1
effectively removes the need for the temporary copy and is 2x faster
than the original "data.ages += 1". This can be seen in the next
simple benchmark:
---------------------------
import numpy, timeit
count = 10e6
ages = numpy.random.randint(0,100,count)
weights = numpy.random.randint(1,200,count)
data = numpy.rec.fromarrays((ages,weights),names='ages,weights')
timer = timeit.Timer('data.ages += 1','from __main__ import data')
print "v0-->", timer.timeit(number=10)
timer = timeit.Timer('a=data.ages; a[:] = a + 1','from __main__ import
data')
print "v1-->", timer.timeit(number=10)
timer = timeit.Timer('a=data.ages; a += 1','from __main__ import data')
print "v2-->", timer.timeit(number=10)
timer = timeit.Timer('ages += 1','from __main__ import ages')
print "v3-->", timer.timeit(number=10)
---------------------------
which produces the next output on my laptop:
v0--> 2.98340201378
v1--> 3.22748112679
v2--> 1.5474319458
v3--> 0.809724807739
As a final comment, I suppose that unary operators (+=, -=...) can be
optimized in the context of recarray columns in numpy, but I don't
think it is worth the effort: when really high performance is needed
for operating with columns in the context of recarrays, a column-wise
approach is best.
Cheers,
--
Francesc Alted
More information about the NumPy-Discussion
mailing list