[Numpy-discussion] Memory usage of numpy-arrays

Thu Jul 8 18:21:43 EDT 2010

Sebastian Haase <seb.haase <at> gmail.com> writes:

> 
> I would expect a 700MB text file translate into less than 200MB of
> data - assuming that you are talking about decimal numbers (maybe
> total of 10 digits each + spaces) and saving as float32 binary.
> So the problem would "only" be the loading in - rather, going through
> - all lines of text from start to end without choking.
> This might be better done "by hand", i.e. in standard (non numpy) python:
> 
> nums = []
> for line in file("myTextFile.txt"):
>      fields = line.split()
>      nums.extend (map(float, fields))
> 
> The last line converts to python-floats which is float64.
> Using lists adds extra bytes behind the scenes.
> So, one would have to read in   in blocks and blockwise convert to
> float32 numpy arrays.
> There is not much more to say unless we know more about the format of
> the text file.
> 
> Regards,
> Sebastian Haase
> 

I actually spent the better part of the afternoon battling with
hdf5-libraries to install Pytable. But then I tried the easy
route and just looped over the file object, collecting the
columns in lists and then writing everything at once into a
tabarray (which is a subclass of numpy.array). The result: memory
usage never goes above 50% and the loading is much faster too.
Of course this method will fail too when data gets even much
larger, but for my needs this pattern seems to be vastly more
efficient than using numpy directly. Maybe this could be
optimized in a future numpy version.

So thanks, Sebastian...