[Numpy-discussion] Advice on converting iterator into array efficiently
Francesc Alted
faltet at pytables.org
Sun Aug 31 08:30:47 EDT 2008
A Saturday 30 August 2008, Alan Jackson escrigué:
> I tested all three offered solutions :
>
> t = table[:] # convert to structured array
> collections = np.unique(t['collection'])
> for collection in collections:
> cond = t['collection'] == collection
> energy_this_collection = t['energy'][cond]
> ----------------------------------
>
> energies = {}
> for row in table:
> c = row['collection']
> e = row['energy']
> if c in energies:
> energies[c].append(e)
> else:
> energies[c] = [e]
>
> # Convert the lists in numpy arrays
> for key in energies:
> energies[key] = numpy.array(energies[key])
> ---------------------------------
>
> for c in np.unique(table.col('collection')) :
> print c,' : ', table.readWhere('collection == c', field='energy')
>
> and the timing results were rather dramatic :
>
> time 1 = 0.79
> time 2 = 0.08
> time 3 = 10.35
>
> This was a test on a relatively small table. I'll have to try it out
> on something really big next and see how the memory usage works out.
Solution 1 is loading the entire table in memory (notice the ``[:]``
operator), so if your table is large this is probably not what you
want. With solution 2, you will end loading only the energies in
memory (first in list form and then as NumPy arrays), which, if your
table has many other fields, can be a big win. Finally, solution 3 is
the one that takes less memory, as it only requires to load in memory
the collection column (I'm assuming that this column has a lighter
datatype than the energy one).
For what is worth, the most conservative solution in terms of memory
usage would be a combination of 2 and 3:
# Find the unique collections
collections = []
for row in table:
c = row['collection']
if c not in collections:
collections.append(c)
# Get the energy collections
for c in sorted(collections):
e = table.readWhere('collection == c', field='energy')
print c,' : ', e
Of course, and as I said in a previous message, this implies many reads
of the table (this is why it is the slowest one).
In case that you need both speed and minimum memory consumption,
indexing the collection column with PyTables Pro could help, but as the
bottleneck is probably in the sparse reads of the table (performed by
the ``readWhere`` method), the gain shouldn't be too much in this case.
Cheers,
--
Francesc Alted
More information about the NumPy-Discussion
mailing list