Re: [Numpydiscussion] Advice please on efficient subtotal function
Hi Stephen,
I'm looking for efficient ways to subtotal a 1d array onto a 2D grid. This is more easily explained in code that words, thus:
for n in xrange(len(data)): totals[ i[n], j[n] ] += data[n]
data comes from a series of PyTables files with ~200m rows. Each row has ~20 cols, and I use the first three columns (which are 13 char strings) to form the indexing functions i[] and j[], then want to calc averages of the remaining 17 numerical cols.
I have tried various indirect ways of doing this with searchsorted and bincount, but intuitively they feel overly complex solutions to what is essentially a very simple problem.
My work involved comparing the subtotals for various different segmentation strategies (the i[] and j[] indexing functions). Efficient solutions are important because I need to make many passes through the 200m rows of data. Memory usage is the easiest thing for me to adjust by changing how many rows of data to read in for each pass and then reusing the same array data buffers.
It looks as if the values in your i and j columns come from a limited range, so you may consider encoding pairs of (i,j) values into one int using a suitable encoding function (e.g. ij = i+K*j if both i and j are nonnegative and K=max(i)+1). You could then use bincount(ij, data) to get the sums per encoded (i,j) pair. This should be efficient, and the complexity is only in the encoding/decoding steps. Best regards, Andreas  Dr. Andreas Eisele Senior Researcher DFKI GmbH, Language Technology Lab eisele@dfki.de Stuhlsatzenhausweg 3 tel: +496813025285 D66123 Saarbrücken fax: +496813025338
participants (1)

Andreas Eisele