Thank you for everything, it works fine ant it is very helpful. Regards, Jean-Baptiste Rudant ________________________________ De : Francesc Alted <faltet@pytables.org> À : Discussion of Numerical Python <numpy-discussion@scipy.org> Envoyé le : Mardi, 30 Décembre 2008, 16h34mn 27s Objet : Re: [Numpy-discussion] Alternative to record array A Tuesday 30 December 2008, Francesc Alted escrigué:
A Monday 29 December 2008, Jean-Baptiste Rudant escrigué: [snip]
The difference for both approaches is that the row-wise arrangement is more efficient when data is iterated by field, while the column-wise one is more efficient when data is iterated by column. This is why you are seeing the increase of 4x in performance --incidentally, by looking at both data arrangements, I'd expect an increase of just 2x (the stride count is 2 in this case), but I suspect that there are hidden copies during the increment operation for the record array case.
As I was mystified about this difference in speed, I kept investigating and I think I have an answer for the difference in the expected speed-up in the unary increment operator over a recarray field. After looking at the numpy code, it turns out that the next statement: data.ages += 1 is more or less equivalent to: a = data.ages a[:] = a + 1 i.e. a temporary is created (for keeping the result of 'a + 1') and then assigned to the 'ages' column. As it happens that, in this sort of operations, the memory copies are the bottleneck, the creation of the first temporary introduced a slowdown of 2x (due to the strided column) and the assignment represents the additional 2x (4x in total). However, the next idiom: a = data.ages a += 1 effectively removes the need for the temporary copy and is 2x faster than the original "data.ages += 1". This can be seen in the next simple benchmark: --------------------------- import numpy, timeit count = 10e6 ages = numpy.random.randint(0,100,count) weights = numpy.random.randint(1,200,count) data = numpy.rec.fromarrays((ages,weights),names='ages,weights') timer = timeit.Timer('data.ages += 1','from __main__ import data') print "v0-->", timer.timeit(number=10) timer = timeit.Timer('a=data.ages; a[:] = a + 1','from __main__ import data') print "v1-->", timer.timeit(number=10) timer = timeit.Timer('a=data.ages; a += 1','from __main__ import data') print "v2-->", timer.timeit(number=10) timer = timeit.Timer('ages += 1','from __main__ import ages') print "v3-->", timer.timeit(number=10) --------------------------- which produces the next output on my laptop: v0--> 2.98340201378 v1--> 3.22748112679 v2--> 1.5474319458 v3--> 0.809724807739 As a final comment, I suppose that unary operators (+=, -=...) can be optimized in the context of recarray columns in numpy, but I don't think it is worth the effort: when really high performance is needed for operating with columns in the context of recarrays, a column-wise approach is best. Cheers, -- Francesc Alted _______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion