[Numpy-discussion] Re : Alternative to record array

2 Jan 2009

      Thank you for everything, it works fine ant it is very helpful.

Regards,

Jean-Baptiste Rudant

________________________________
De : Francesc Alted <faltet@pytables.org>
À : Discussion of Numerical Python <numpy-discussion@scipy.org>
Envoyé le : Mardi, 30 Décembre 2008, 16h34mn 27s
Objet : Re: [Numpy-discussion] Alternative to record array

A Tuesday 30 December 2008, Francesc Alted escrigué:
...
A Monday 29 December 2008, Jean-Baptiste Rudant escrigué:
[snip]
The difference for both approaches is that the row-wise arrangement
is more efficient when data is iterated by field, while the
column-wise one is more efficient when data is iterated by column. 
This is why you are seeing the increase of 4x in performance
--incidentally, by looking at both data arrangements, I'd expect an
increase of just 2x (the stride count is 2 in this case), but I
suspect that there are hidden copies during the increment operation
for the record array case.
As I was mystified about this difference in speed, I kept investigating 
and I think I have an answer for the difference in the expected 
speed-up in the unary increment operator over a recarray field.  After 
looking at the numpy code, it turns out that the next statement:

data.ages += 1

is more or less equivalent to:

a = data.ages
a[:] = a + 1

i.e. a temporary is created (for keeping the result of 'a + 1') and then 
assigned to the 'ages' column.  As it happens that, in this sort of 
operations, the memory copies are the bottleneck, the creation of the 
first temporary introduced a slowdown of 2x (due to the strided column) 
and the assignment represents the additional 2x (4x in total).  
However, the next idiom:

a = data.ages
a += 1

effectively removes the need for the temporary copy and is 2x faster 
than the original "data.ages += 1".  This can be seen in the next 
simple benchmark:

---------------------------
import numpy, timeit

count = 10e6
ages  = numpy.random.randint(0,100,count)
weights = numpy.random.randint(1,200,count)
data = numpy.rec.fromarrays((ages,weights),names='ages,weights')

timer = timeit.Timer('data.ages += 1','from __main__ import data')
print "v0-->", timer.timeit(number=10)
timer = timeit.Timer('a=data.ages; a[:] = a + 1','from __main__ import 
data')
print "v1-->", timer.timeit(number=10)
timer = timeit.Timer('a=data.ages; a += 1','from __main__ import data')
print "v2-->", timer.timeit(number=10)
timer = timeit.Timer('ages += 1','from __main__ import ages')
print "v3-->", timer.timeit(number=10)
---------------------------

which produces the next output on my laptop:

v0--> 2.98340201378
v1--> 3.22748112679
v2--> 1.5474319458
v3--> 0.809724807739

As a final comment, I suppose that unary operators (+=, -=...) can be 
optimized in the context of recarray columns in numpy, but I don't 
think it is worth the effort:  when really high performance is needed 
for operating with columns in the context of recarrays, a column-wise 
approach is best.

Cheers,

-- 
Francesc Alted
_______________________________________________
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion