[Numpy-discussion] Advice please on efficient subtotal function

Fri Dec 29 10:05:28 EST 2006

A Divendres 29 Desembre 2006 10:05, Stephen Simmons escrigué:
> Hi,
>
> I'm looking for efficient ways to subtotal a 1-d array onto a 2-D grid.
> This is more easily explained in code that words, thus:
>
> for n in xrange(len(data)):
>     totals[ i[n], j[n] ] += data[n]
>
> data comes from a series of PyTables files with ~200m rows. Each row has
> ~20 cols, and I use the first three columns (which are 1-3 char strings) to
> form the indexing functions i[] and j[], then want to calc averages of the
> remaining 17 numerical cols.
>
> I have tried various indirect ways of doing this with searchsorted and
> bincount, but intuitively they feel overly complex solutions to what is
> essentially a very simple problem.
>
> My work involved comparing the subtotals for various different segmentation
> strategies (the i[] and j[] indexing functions). Efficient solutions are
> important because I need to make many passes through the 200m rows of data.
> Memory usage is the easiest thing for me to adjust by changing how many
> rows of data to read in for each pass and then reusing the same array data
> buffers.

Well, from your words I guess you should already have tested this, but just in 
case. As PyTables saves data in tables row-wise, it is always faster using 
the complete row for computations in each iteration than using just a single 
column. This is shown in the small benchmark that I'm attaching at the end of 
the message. Here is its output for a table with 1m rows:

time for creating the file--> 12.044
time for using column reads --> 46.407
time for using the row wise iterator--> 73.036
time for using block reads (row wise)--> 5.156

So, using block reads (in case you can use them) is your best bet.

HTH,

--------------------------------------------------------------------------------------
import tables
import numpy
from time import time

nrows = 1000*1000

# Create a table definition with 17 double cols and 3 string cols
coltypes = numpy.dtype("f8,"*17 + "S3,"*3)

t1 = time()
# Create a file with an empty table. Use compression to minimize file size.
f = tables.openFile("/tmp/prova.h5", 'w')
table = f.createTable(f.root, 'table', numpy.empty(0, coltypes),
                      filters=tables.Filters(complevel=1, complib='lzo'))
# Fill the table with default values (empty strings and zeros)
row = table.row
for nrow in xrange(nrows):
    row.append()
f.close()
print "time for creating the file-->", round(time()-t1, 3)

# *********** Start benchmarks **************************
f = tables.openFile("/tmp/prova.h5", 'r')
table = f.root.table
colnames = table.colnames[:-3]  # exclude the string cols

# Loop over the table using column reads
t1 = time(); cum = numpy.zeros(17)
for ncol, colname in enumerate(colnames):
    col = table.read(0, nrows, field=colname)
    cum[ncol] += col.sum()
print "time for using column reads -->", round(time()-t1, 3)

# Loop over the table using its row iterator
t1 = time(); cum = numpy.zeros(17)
for row in table:
    for ncol, colname in enumerate(colnames):
        cum[ncol] += row[colname]
print "time for using the row iterator-->", round(time()-t1, 3)

# Loop over the table using block reads (row wise)
t1 = time(); cum = numpy.zeros(17)
step = 10000
for nrow in xrange(0, nrows, step):
    ra = table[nrow:nrow+step]
    for ncol, colname in enumerate(colnames):
        cum[ncol] += ra[colname].sum()
print "time for using block reads (row wise)-->", round(time()-t1, 3)

f.close()

-- 
>0,0<   Francesc Altet     http://www.carabos.com/
V   V   Cárabos Coop. V.   Enjoy Data
 "-"