[Numpy-discussion] Advice on converting iterator into array efficiently

Sun Aug 31 08:30:47 EDT 2008

A Saturday 30 August 2008, Alan Jackson escrigué:
> I tested all three offered solutions :
>
> t = table[:] # convert to structured array
> collections = np.unique(t['collection'])
> for collection in collections:
>     cond = t['collection'] == collection
>     energy_this_collection = t['energy'][cond]
> ----------------------------------
>
> energies = {}
> for row in table:
>     c = row['collection']
>     e = row['energy']
>     if c in energies:
>         energies[c].append(e)
>     else:
> 	energies[c] = [e]
>
> # Convert the lists in numpy arrays
> for key in energies:
>     energies[key] = numpy.array(energies[key])
> ---------------------------------
>
> for c in np.unique(table.col('collection')) :
>     print c,' : ', table.readWhere('collection == c', field='energy')
>
> and the timing results were rather dramatic :
>
> time 1 =  0.79
> time 2 =  0.08
> time 3 =  10.35
>
> This was a test on a relatively small table. I'll have to try it out
> on something really big next and see how the memory usage works out.

Solution 1 is loading the entire table in memory (notice the ``[:]`` 
operator), so if your table is large this is probably not what you 
want.  With solution 2, you will end loading only the energies in 
memory (first in list form and then as NumPy arrays), which, if your 
table has many other fields, can be a big win.  Finally, solution 3 is 
the one that takes less memory, as it only requires to load in memory 
the collection column (I'm assuming that this column has a lighter 
datatype than the energy one).

For what is worth, the most conservative solution in terms of memory 
usage would be a combination of 2 and 3:

# Find the unique collections
collections = []
for row in table:
    c = row['collection']
    if c not in collections:
        collections.append(c)
# Get the energy collections
for c in sorted(collections):
    e = table.readWhere('collection == c', field='energy')
    print c,' : ', e

Of course, and as I said in a previous message, this implies many reads 
of the table (this is why it is the slowest one).

In case that you need both speed and minimum memory consumption, 
indexing the collection column with PyTables Pro could help, but as the 
bottleneck is probably in the sparse reads of the table (performed by 
the ``readWhere`` method), the gain shouldn't be too much in this case.

Cheers,

-- 
Francesc Alted