Dear NumPy developers, I have to process some big data files with high-frequency financial data. I am trying to load a delimited text file having ~700 MB with ~ 10 million lines using numpy.genfromtxt(). The machine is a Debian Lenny server 32bit with 3GB of memory. Since the file is just 700MB I am naively assuming that it should fit into memory in whole. However, when I attempt to load it, python fills the entire available memory and then fails with Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/local/lib/python2.6/site-packages/numpy/lib/io.py", line 1318, in genfromtxt errmsg = "\n".join(errmsg) MemoryError Is there a way to load this file without crashing? Thanks, Hannes
On Thu, Jul 8, 2010 at 9:26 AM, Hannes Bretschneider <hannes.bretschneider@wiwi.hu-berlin.de> wrote:
Dear NumPy developers,
I have to process some big data files with high-frequency financial data. I am trying to load a delimited text file having ~700 MB with ~ 10 million lines using numpy.genfromtxt(). The machine is a Debian Lenny server 32bit with 3GB of memory. Since the file is just 700MB I am naively assuming that it should fit into memory in whole. However, when I attempt to load it, python fills the entire available memory and then fails with
Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/local/lib/python2.6/site-packages/numpy/lib/io.py", line 1318, in genfromtxt errmsg = "\n".join(errmsg) MemoryError
Is there a way to load this file without crashing?
Thanks, Hannes
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
From my experience I might suggest using PyTables (HDF5) as intermediate storage for the data which can be populated iteratively (you'll have to parse the data yourself, marking missing data could be a problem). This of course requires that you know the column schema ahead of time which is one thing that np.genfromtxt will handle automatically. Particularly if you have a large static data set this can be worthwhile as reading the data out of HDF5 will be many times faster than parsing the text file.
I believe you can also append rows to the PyTables Table structure in chunks which would be faster than appending one row at a time. hth, Wes
On 07/08/2010 08:52 AM, Wes McKinney wrote:
On Thu, Jul 8, 2010 at 9:26 AM, Hannes Bretschneider <hannes.bretschneider@wiwi.hu-berlin.de> wrote:
Dear NumPy developers,
I have to process some big data files with high-frequency financial data. I am trying to load a delimited text file having ~700 MB with ~ 10 million lines using numpy.genfromtxt(). The machine is a Debian Lenny server 32bit with 3GB of memory. Since the file is just 700MB I am naively assuming that it should fit into memory in whole. However, when I attempt to load it, python fills the entire available memory and then fails with
Traceback (most recent call last): File "<stdin>", line 1, in<module> File "/usr/local/lib/python2.6/site-packages/numpy/lib/io.py", line 1318, in genfromtxt errmsg = "\n".join(errmsg) MemoryError
Is there a way to load this file without crashing?
Thanks, Hannes
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
From my experience I might suggest using PyTables (HDF5) as intermediate storage for the data which can be populated iteratively (you'll have to parse the data yourself, marking missing data could be a problem). This of course requires that you know the column schema ahead of time which is one thing that np.genfromtxt will handle automatically. Particularly if you have a large static data set this can be worthwhile as reading the data out of HDF5 will be many times faster than parsing the text file.
I believe you can also append rows to the PyTables Table structure in chunks which would be faster than appending one row at a time.
hth, Wes _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
There have been past discussions on this. Numpy needs contiguous memory so you are running out of memory because as loading the original data and the numpy array will exhaust your available contiguous memory. Note that a file of ~700 MB does not translate into ~700 MB of memory since it depends on the dtypes. Also a system with 3GB of memory probably has about 1.5GB of free memory available (you might get closer to 2GB if you have a very lean system). If you know your data then you have do all the hard work yourself to minimize memory usage or use something like hdf5 or PyTables. Bruce
On Thu, Jul 8, 2010 at 4:46 PM, Bruce Southey <bsouthey@gmail.com> wrote:
On 07/08/2010 08:52 AM, Wes McKinney wrote:
On Thu, Jul 8, 2010 at 9:26 AM, Hannes Bretschneider <hannes.bretschneider@wiwi.hu-berlin.de> wrote:
Dear NumPy developers,
I have to process some big data files with high-frequency financial data. I am trying to load a delimited text file having ~700 MB with ~ 10 million lines using numpy.genfromtxt(). The machine is a Debian Lenny server 32bit with 3GB of memory. Since the file is just 700MB I am naively assuming that it should fit into memory in whole. However, when I attempt to load it, python fills the entire available memory and then fails with
Traceback (most recent call last): File "<stdin>", line 1, in<module> File "/usr/local/lib/python2.6/site-packages/numpy/lib/io.py", line 1318, in genfromtxt errmsg = "\n".join(errmsg) MemoryError
Is there a way to load this file without crashing?
Thanks, Hannes
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
From my experience I might suggest using PyTables (HDF5) as intermediate storage for the data which can be populated iteratively (you'll have to parse the data yourself, marking missing data could be a problem). This of course requires that you know the column schema ahead of time which is one thing that np.genfromtxt will handle automatically. Particularly if you have a large static data set this can be worthwhile as reading the data out of HDF5 will be many times faster than parsing the text file.
I believe you can also append rows to the PyTables Table structure in chunks which would be faster than appending one row at a time.
hth, Wes _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
There have been past discussions on this. Numpy needs contiguous memory so you are running out of memory because as loading the original data and the numpy array will exhaust your available contiguous memory. Note that a file of ~700 MB does not translate into ~700 MB of memory since it depends on the dtypes. Also a system with 3GB of memory probably has about 1.5GB of free memory available (you might get closer to 2GB if you have a very lean system).
If you know your data then you have do all the hard work yourself to minimize memory usage or use something like hdf5 or PyTables.
Bruce
I would expect a 700MB text file translate into less than 200MB of data - assuming that you are talking about decimal numbers (maybe total of 10 digits each + spaces) and saving as float32 binary. So the problem would "only" be the loading in - rather, going through - all lines of text from start to end without choking. This might be better done "by hand", i.e. in standard (non numpy) python: nums = [] for line in file("myTextFile.txt"): fields = line.split() nums.extend (map(float, fields)) The last line converts to python-floats which is float64. Using lists adds extra bytes behind the scenes. So, one would have to read in in blocks and blockwise convert to float32 numpy arrays. There is not much more to say unless we know more about the format of the text file. Regards, Sebastian Haase
Sebastian Haase wrote:
This might be better done "by hand", i.e. in standard (non numpy) python:
nums = [] for line in file("myTextFile.txt"): fields = line.split() nums.extend (map(float, fields))
if you know how big your array needs to be, you can pre-allocate it with: np.empty() or np.ones() Then fill it in as you read the file -- that will be as memory efficient as you can do it. Another option: I wrote a expendable array class in Python a while back, which turns out not to be faster than using a list, but it should be more memory efficient -- you might try it (enclosed). -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
Sebastian Haase <seb.haase <at> gmail.com> writes:
I would expect a 700MB text file translate into less than 200MB of data - assuming that you are talking about decimal numbers (maybe total of 10 digits each + spaces) and saving as float32 binary. So the problem would "only" be the loading in - rather, going through - all lines of text from start to end without choking. This might be better done "by hand", i.e. in standard (non numpy) python:
nums = [] for line in file("myTextFile.txt"): fields = line.split() nums.extend (map(float, fields))
The last line converts to python-floats which is float64. Using lists adds extra bytes behind the scenes. So, one would have to read in in blocks and blockwise convert to float32 numpy arrays. There is not much more to say unless we know more about the format of the text file.
Regards, Sebastian Haase
I actually spent the better part of the afternoon battling with hdf5-libraries to install Pytable. But then I tried the easy route and just looped over the file object, collecting the columns in lists and then writing everything at once into a tabarray (which is a subclass of numpy.array). The result: memory usage never goes above 50% and the loading is much faster too. Of course this method will fail too when data gets even much larger, but for my needs this pattern seems to be vastly more efficient than using numpy directly. Maybe this could be optimized in a future numpy version. So thanks, Sebastian...
participants (5)
-
Bruce Southey
-
Christopher Barker
-
Hannes Bretschneider
-
Sebastian Haase
-
Wes McKinney