On Thu, Jul 8, 2010 at 4:46 PM, Bruce Southey <bsouthey@gmail.com> wrote:
On 07/08/2010 08:52 AM, Wes McKinney wrote:
On Thu, Jul 8, 2010 at 9:26 AM, Hannes Bretschneider <hannes.bretschneider@wiwi.hu-berlin.de> wrote:
Dear NumPy developers,
I have to process some big data files with high-frequency financial data. I am trying to load a delimited text file having ~700 MB with ~ 10 million lines using numpy.genfromtxt(). The machine is a Debian Lenny server 32bit with 3GB of memory. Since the file is just 700MB I am naively assuming that it should fit into memory in whole. However, when I attempt to load it, python fills the entire available memory and then fails with
Traceback (most recent call last): File "<stdin>", line 1, in<module> File "/usr/local/lib/python2.6/site-packages/numpy/lib/io.py", line 1318, in genfromtxt errmsg = "\n".join(errmsg) MemoryError
Is there a way to load this file without crashing?
Thanks, Hannes
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
From my experience I might suggest using PyTables (HDF5) as intermediate storage for the data which can be populated iteratively (you'll have to parse the data yourself, marking missing data could be a problem). This of course requires that you know the column schema ahead of time which is one thing that np.genfromtxt will handle automatically. Particularly if you have a large static data set this can be worthwhile as reading the data out of HDF5 will be many times faster than parsing the text file.
I believe you can also append rows to the PyTables Table structure in chunks which would be faster than appending one row at a time.
hth, Wes _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
There have been past discussions on this. Numpy needs contiguous memory so you are running out of memory because as loading the original data and the numpy array will exhaust your available contiguous memory. Note that a file of ~700 MB does not translate into ~700 MB of memory since it depends on the dtypes. Also a system with 3GB of memory probably has about 1.5GB of free memory available (you might get closer to 2GB if you have a very lean system).
If you know your data then you have do all the hard work yourself to minimize memory usage or use something like hdf5 or PyTables.
Bruce
I would expect a 700MB text file translate into less than 200MB of data - assuming that you are talking about decimal numbers (maybe total of 10 digits each + spaces) and saving as float32 binary. So the problem would "only" be the loading in - rather, going through - all lines of text from start to end without choking. This might be better done "by hand", i.e. in standard (non numpy) python: nums = [] for line in file("myTextFile.txt"): fields = line.split() nums.extend (map(float, fields)) The last line converts to python-floats which is float64. Using lists adds extra bytes behind the scenes. So, one would have to read in in blocks and blockwise convert to float32 numpy arrays. There is not much more to say unless we know more about the format of the text file. Regards, Sebastian Haase