Mailman 3 Re: [Numpy-discussion] Advice please on efficient subtotal function - NumPy-Discussion

30 Dec 2006


      Hi Stephen,
...
I'm looking for efficient ways to subtotal a 1-d array onto a 2-D
grid. This
is more easily explained in code that words, thus:
for n in xrange(len(data)):
    totals[ i[n], j[n] ] += data[n]
data comes from a series of PyTables files with ~200m rows. Each row
has ~20
cols, and I use the first three columns (which are 1-3 char strings)
to form
the indexing functions i[] and j[], then want to calc averages of the
remaining 17 numerical cols.
I have tried various indirect ways of doing this with searchsorted and
bincount, but intuitively they feel overly complex solutions to what
is
essentially a very simple problem.
My work involved comparing the subtotals for various different
segmentation
strategies (the i[] and j[] indexing functions). Efficient solutions
are
important because I need to make many passes through the 200m rows of
data.
Memory usage is the easiest thing for me to adjust by changing how
many rows
of data to read in for each pass and then reusing the same array data
buffers.
It looks as if the values in your i and j columns come from a limited 
range, so you may consider encoding pairs of (i,j) values into one int 
using a suitable encoding function (e.g. ij = i+K*j if both i and j are
non-negative and K=max(i)+1). You could then use bincount(ij, data) to
get the sums per encoded (i,j) pair. This should be efficient, and the
complexity is only in the encoding/decoding steps.

Best regards,
Andreas
----
Dr. Andreas Eisele                     Senior Researcher
DFKI GmbH, Language Technology Lab        eisele@dfki.de
Stuhlsatzenhausweg 3               tel: +49-681-302-5285
D-66123 Saarbrücken                fax: +49-681-302-5338

Re: [Numpy-discussion] Advice please on efficient subtotal function

Andreas Eisele

tags

participants (1)