Sebastian Haase <seb.haase <at> gmail.com> writes:
I would expect a 700MB text file translate into less than 200MB of data - assuming that you are talking about decimal numbers (maybe total of 10 digits each + spaces) and saving as float32 binary. So the problem would "only" be the loading in - rather, going through - all lines of text from start to end without choking. This might be better done "by hand", i.e. in standard (non numpy) python:
nums = [] for line in file("myTextFile.txt"): fields = line.split() nums.extend (map(float, fields))
The last line converts to python-floats which is float64. Using lists adds extra bytes behind the scenes. So, one would have to read in in blocks and blockwise convert to float32 numpy arrays. There is not much more to say unless we know more about the format of the text file.
Regards, Sebastian Haase
I actually spent the better part of the afternoon battling with hdf5-libraries to install Pytable. But then I tried the easy route and just looped over the file object, collecting the columns in lists and then writing everything at once into a tabarray (which is a subclass of numpy.array). The result: memory usage never goes above 50% and the loading is much faster too. Of course this method will fail too when data gets even much larger, but for my needs this pattern seems to be vastly more efficient than using numpy directly. Maybe this could be optimized in a future numpy version. So thanks, Sebastian...