Re: [Numpy-discussion] Memory usage of numpy-arrays

8 Jul 2010

      On Thu, Jul 8, 2010 at 4:46 PM, Bruce Southey <bsouthey@gmail.com> wrote:
...
On 07/08/2010 08:52 AM, Wes McKinney wrote:
...
On Thu, Jul 8, 2010 at 9:26 AM, Hannes Bretschneider
<hannes.bretschneider@wiwi.hu-berlin.de>  wrote:
...
Dear NumPy developers,
I have to process some big data files with high-frequency
financial data. I am trying to load a delimited text file having
~700 MB with ~ 10 million lines using numpy.genfromtxt(). The
machine is a Debian Lenny server 32bit with 3GB of memory.  Since
the file is just 700MB I am naively assuming that it should fit
into memory in whole. However, when I attempt to load it, python
fills the entire available memory and then fails with
Traceback (most recent call last):
  File "<stdin>", line 1, in<module>
  File "/usr/local/lib/python2.6/site-packages/numpy/lib/io.py", line 1318, in genfromtxt
    errmsg = "\n".join(errmsg)
MemoryError
Is there a way to load this file without crashing?
Thanks, Hannes
_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion
From my experience I might suggest using PyTables (HDF5) as
intermediate storage for the data which can be populated iteratively
(you'll have to parse the data yourself, marking missing data could be
a problem). This of course requires that you know the column schema
ahead of time which is one thing that np.genfromtxt will handle
automatically. Particularly if you have a large static data set this
can be worthwhile as reading the data out of HDF5 will be many times
faster than parsing the text file.
I believe you can also append rows to the PyTables Table structure in
chunks which would be faster than appending one row at a time.
hth,
Wes
_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion
There have been past discussions on this. Numpy needs contiguous memory
so you are running out of memory because as loading the original data
and the numpy array will exhaust your available contiguous memory. Note
that a file of ~700 MB does not translate into ~700 MB of memory since
it depends on the dtypes. Also a system with 3GB of memory probably has
about 1.5GB of free memory available (you might get closer to 2GB if you
have a very lean system).
If you know your data then you have do all the hard work yourself to
minimize memory usage or use something like hdf5 or PyTables.
Bruce
I would expect a 700MB text file translate into less than 200MB of
data - assuming that you are talking about decimal numbers (maybe
total of 10 digits each + spaces) and saving as float32 binary.
So the problem would "only" be the loading in - rather, going through
- all lines of text from start to end without choking.
This might be better done "by hand", i.e. in standard (non numpy) python:

nums = []
for line in file("myTextFile.txt"):
     fields = line.split()
     nums.extend (map(float, fields))

The last line converts to python-floats which is float64.
Using lists adds extra bytes behind the scenes.
So, one would have to read in   in blocks and blockwise convert to
float32 numpy arrays.
There is not much more to say unless we know more about the format of
the text file.

Regards,
Sebastian Haase