[Numpy-discussion] Advice on converting iterator into array efficiently

Fri Aug 29 06:19:39 EDT 2008

A Friday 29 August 2008, Francesc Alted escrigué:
> A Friday 29 August 2008, Alan Jackson escrigué:
> > Looking for advice on a good way to handle this problem.
> >
> > I'm dealing with large tables (Gigabyte large). I would like to
> > efficiently subset values from one column based on the values in
> > another column, and get arrays out of the operation. For example,
> > say I have 2 columns, "energy" and "collection". Collection is
> > basically an index that flags values that go together, so all the
> > energy values with a collection value of 18 belong together. I'd
> > like to be able to set up an iterator on collection that would
> > hand me an array of energy on each iteration :
> >
> > if table is all my data, then something like
> >
> > for c in table['collection'] :
> >     e = c['energy']
> >     ... do array operations on e
> >
> > I've been playing with pytables, and they help, but I can't quite
> > seem to get there. I can get an iterator for energy within a
> > collection, but I can't figure out an efficient way to get an array
> > out.
> >
> > What I have so far is
> >
> > for h in np.unique(table.col('collection')) :
> >     rows = table.where('collection == c')
> >     for row in rows :
> >         print c,' : ', row['energy']
> >
> > but I really want to convert rows['energy'] to an array.
>
> You may use a list to keep the values and then convert it to an
> array. Also, yoy can use a dictionary for keeping the unique
> collections.  The next should do the trick:
>
> energies = {}
> for row in table:
>     c = row['collection']
>     e = row['energy']
>     if c in energies:
>         energies[c].append(e)
>     else:
> 	energies[c] = [e]
>
> # Convert the lists in numpy arrays
> for key in energies:
>     energies[key] = numpy.array(energies[key])

Er... I completely forgot about the fine ``Table.whereRead()`` method.  
By using it, you can do:

for c in np.unique(table.col('collection')) :
    print c,' : ', table.readWhere('collection == c', field='energy')

which is what you want.  However, this solution does require to read the 
entire table for each value of the collection (and once more in order 
to read the column of collections at the beginning).  Do your timings 
and choose whatever approach you prefer.

Cheers,

-- 
Francesc Alted