Hi, I'm looking for efficient ways to subtotal a 1d array onto a 2D grid. This is more easily explained in code that words, thus: for n in xrange(len(data)): totals[ i[n], j[n] ] += data[n] data comes from a series of PyTables files with ~200m rows. Each row has ~20 cols, and I use the first three columns (which are 13 char strings) to form the indexing functions i[] and j[], then want to calc averages of the remaining 17 numerical cols. I have tried various indirect ways of doing this with searchsorted and bincount, but intuitively they feel overly complex solutions to what is essentially a very simple problem. My work involved comparing the subtotals for various different segmentation strategies (the i[] and j[] indexing functions). Efficient solutions are important because I need to make many passes through the 200m rows of data. Memory usage is the easiest thing for me to adjust by changing how many rows of data to read in for each pass and then reusing the same array data buffers. Thanks in advance for any suggestions! Stephen
Hi Stephen,
If you want to sum/average down a column or across a row you can use sum().
The optional axis={0,1} parameter determines whether you are summing down a
column (default or axis=0) or across a row (axis=1).
Greg
On 12/29/06, Stephen Simmons
Hi,
I'm looking for efficient ways to subtotal a 1d array onto a 2D grid. This is more easily explained in code that words, thus:
for n in xrange(len(data)): totals[ i[n], j[n] ] += data[n]
data comes from a series of PyTables files with ~200m rows. Each row has ~20 cols, and I use the first three columns (which are 13 char strings) to form the indexing functions i[] and j[], then want to calc averages of the remaining 17 numerical cols.
I have tried various indirect ways of doing this with searchsorted and bincount, but intuitively they feel overly complex solutions to what is essentially a very simple problem.
My work involved comparing the subtotals for various different segmentation strategies (the i[] and j[] indexing functions). Efficient solutions are important because I need to make many passes through the 200m rows of data. Memory usage is the easiest thing for me to adjust by changing how many rows of data to read in for each pass and then reusing the same array data buffers.
Thanks in advance for any suggestions!
Stephen
_______________________________________________ Numpydiscussion mailing list Numpydiscussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpydiscussion
 Linux. Because rebooting is for adding hardware.
A Divendres 29 Desembre 2006 10:05, Stephen Simmons escrigué:
Hi,
I'm looking for efficient ways to subtotal a 1d array onto a 2D grid. This is more easily explained in code that words, thus:
for n in xrange(len(data)): totals[ i[n], j[n] ] += data[n]
data comes from a series of PyTables files with ~200m rows. Each row has ~20 cols, and I use the first three columns (which are 13 char strings) to form the indexing functions i[] and j[], then want to calc averages of the remaining 17 numerical cols.
I have tried various indirect ways of doing this with searchsorted and bincount, but intuitively they feel overly complex solutions to what is essentially a very simple problem.
My work involved comparing the subtotals for various different segmentation strategies (the i[] and j[] indexing functions). Efficient solutions are important because I need to make many passes through the 200m rows of data. Memory usage is the easiest thing for me to adjust by changing how many rows of data to read in for each pass and then reusing the same array data buffers.
Well, from your words I guess you should already have tested this, but just in case. As PyTables saves data in tables rowwise, it is always faster using the complete row for computations in each iteration than using just a single column. This is shown in the small benchmark that I'm attaching at the end of the message. Here is its output for a table with 1m rows: time for creating the file> 12.044 time for using column reads > 46.407 time for using the row wise iterator> 73.036 time for using block reads (row wise)> 5.156 So, using block reads (in case you can use them) is your best bet. HTH,  import tables import numpy from time import time nrows = 1000*1000 # Create a table definition with 17 double cols and 3 string cols coltypes = numpy.dtype("f8,"*17 + "S3,"*3) t1 = time() # Create a file with an empty table. Use compression to minimize file size. f = tables.openFile("/tmp/prova.h5", 'w') table = f.createTable(f.root, 'table', numpy.empty(0, coltypes), filters=tables.Filters(complevel=1, complib='lzo')) # Fill the table with default values (empty strings and zeros) row = table.row for nrow in xrange(nrows): row.append() f.close() print "time for creating the file>", round(time()t1, 3) # *********** Start benchmarks ************************** f = tables.openFile("/tmp/prova.h5", 'r') table = f.root.table colnames = table.colnames[:3] # exclude the string cols # Loop over the table using column reads t1 = time(); cum = numpy.zeros(17) for ncol, colname in enumerate(colnames): col = table.read(0, nrows, field=colname) cum[ncol] += col.sum() print "time for using column reads >", round(time()t1, 3) # Loop over the table using its row iterator t1 = time(); cum = numpy.zeros(17) for row in table: for ncol, colname in enumerate(colnames): cum[ncol] += row[colname] print "time for using the row iterator>", round(time()t1, 3) # Loop over the table using block reads (row wise) t1 = time(); cum = numpy.zeros(17) step = 10000 for nrow in xrange(0, nrows, step): ra = table[nrow:nrow+step] for ncol, colname in enumerate(colnames): cum[ncol] += ra[colname].sum() print "time for using block reads (row wise)>", round(time()t1, 3) f.close() 
0,0< Francesc Altet http://www.carabos.com/ V V Cárabos Coop. V. Enjoy Data ""
Thanks Francesc, but I am already planning to read the data in blockwise as
you suggest. My question is rather how best to update the subtotals for each
block in a parallel way using numpy efficiently, rather than a simplistic and
slow elementbyelement loop.
I can't use a simple sum(), as in your benchmark example or Greg's reply,
because I need to do:
for n in xrange(len(data)):
totals[ i[n], j[n] ] += data[n]
and not
for n in xrange(len(data)):
totals[n] += data[n]
My best solution so far is roughly like this:
 read in next block of 100k or so rows (taking into account the PyTables
table's _v_maxTuples and _v_chunksize)
 calculate the subtotal index arrays i and j
 do a lexsort() on [i, j, n]
 partition the sorted [i, j, n] into subsets where the i and j arrays change
values. The k_th such subset is thus
s_k = [ i_k, j_k, [n_k0, ..., n_kN] ]
 update the subtotals for each subset in the block
totals[i_k, j_k]+= sum(data[n_k0, ..., n_kN])
This should be reasonably efficient, but it's messy, and I'm not familar
enough with numpy's indexing tricks to get this right first time.
Maybe instead I'll have a go at writing a Pyrex function that implements the
simple loop at C speed:
subtotal2d(data_array, idx_array, out=None, dtype=None)
where data_array is Nx1, idx_array is NxM and out is Mdimensional.
Incidentally, there's one other function I'd find useful here in forming the
index arrays i[] and j[], a fast translatefromdict function:
arr2 = fromiter( d[a] for a in arr1 )
My initial impression is that a C version would be substantially faster; maybe
I should do some benchmarking to see whether a pure Python/numpy approach is
actually faster than I expect.
Cheers, and thanks for any further suggestions,
Stephen
Francesc Altet
A Divendres 29 Desembre 2006 10:05, Stephen Simmons escrigué:
Hi,
I'm looking for efficient ways to subtotal a 1d array onto a 2D grid. This is more easily explained in code that words, thus:
for n in xrange(len(data)): totals[ i[n], j[n] ] += data[n]
data comes from a series of PyTables files with ~200m rows. Each row has ~20 cols, and I use the first three columns (which are 13 char strings) to form the indexing functions i[] and j[], then want to calc averages of the remaining 17 numerical cols.
I have tried various indirect ways of doing this with searchsorted and bincount, but intuitively they feel overly complex solutions to what is essentially a very simple problem.
My work involved comparing the subtotals for various different segmentation strategies (the i[] and j[] indexing functions). Efficient solutions are important because I need to make many passes through the 200m rows of data. Memory usage is the easiest thing for me to adjust by changing how many rows of data to read in for each pass and then reusing the same array data buffers.
Well, from your words I guess you should already have tested this, but just in
case. As PyTables saves data in tables rowwise, it is always faster using the complete row for computations in each iteration than using just a single
column. This is shown in the small benchmark that I'm attaching at the end of the message. Here is its output for a table with 1m rows:
time for creating the file> 12.044 time for using column reads > 46.407 time for using the row wise iterator> 73.036 time for using block reads (row wise)> 5.156
So, using block reads (in case you can use them) is your best bet.
HTH,

import tables import numpy from time import time
nrows = 1000*1000
# Create a table definition with 17 double cols and 3 string cols coltypes = numpy.dtype("f8,"*17 + "S3,"*3)
t1 = time() # Create a file with an empty table. Use compression to minimize file size. f = tables.openFile("/tmp/prova.h5", 'w') table = f.createTable(f.root, 'table', numpy.empty(0, coltypes), filters=tables.Filters(complevel=1, complib='lzo')) # Fill the table with default values (empty strings and zeros) row = table.row for nrow in xrange(nrows): row.append() f.close() print "time for creating the file>", round(time()t1, 3)
# *********** Start benchmarks ************************** f = tables.openFile("/tmp/prova.h5", 'r') table = f.root.table colnames = table.colnames[:3] # exclude the string cols
# Loop over the table using column reads t1 = time(); cum = numpy.zeros(17) for ncol, colname in enumerate(colnames): col = table.read(0, nrows, field=colname) cum[ncol] += col.sum() print "time for using column reads >", round(time()t1, 3)
# Loop over the table using its row iterator t1 = time(); cum = numpy.zeros(17) for row in table: for ncol, colname in enumerate(colnames): cum[ncol] += row[colname] print "time for using the row iterator>", round(time()t1, 3)
# Loop over the table using block reads (row wise) t1 = time(); cum = numpy.zeros(17) step = 10000 for nrow in xrange(0, nrows, step): ra = table[nrow:nrow+step] for ncol, colname in enumerate(colnames): cum[ncol] += ra[colname].sum() print "time for using block reads (row wise)>", round(time()t1, 3)
f.close()
participants (3)

Francesc Altet

Greg Willden

Stephen Simmons