A coworker is trying to load a 1Gb text data file into a numpy array using numpy.loadtxt, but he says it is using up all of his machine's 6Gb of RAM. Is there a more efficient way to read such text data files? -- Russell
On 10 Aug 2011, at 19:22, Russell E. Owen wrote:
A coworker is trying to load a 1Gb text data file into a numpy array using numpy.loadtxt, but he says it is using up all of his machine's 6Gb of RAM. Is there a more efficient way to read such text data files?
The npyio routines (loadtxt as well as genfromtxt) first read in the entire data as lists, which creates of course significant overhead, but is not easy to circumvent, since numpy arrays are immutable - so you have to first store the numbers in some kind of mutable object. One could write a custom parser that tries to be somewhat more efficient, e.g. first reading in sub-arrays from a smaller buffer. Concatenating those sub-arrays would still require about twice the memory of the final array. I don't know if using the array.array type (which is mutable) is much more efficient than a list... To really avoid any excess memory usage you'd have to know the total data size in advance - either by reading in the file in a first pass to count the rows, or explicitly specifying it to a custom reader. Basically, assuming a completely regular file without missing values etc., you could then read in the data like X = np.zeros((n_lines, n_columns), dtype=float) delimiter = ' ' for n, line in enumerate(file(fname, 'r')): X[n] = np.array(line.split(delimiter), dtype=float) (adjust delimiter and dtype as needed...) HTH, Derek
There was also some work on a semi-mutable array type that allowed appending along one axis, then 'freezing' to yield a normal numpy array (unfortunately I'm not sure how to find it in the mailing list archives). One could write such a setup by hand, using mmap() or realloc(), but I'd be inclined to simply write a filter that converted the text file to some sort of binary file on the fly, value by value. Then the file can be loaded in or mmap()ed. A 1 Gb text file is a miserable object anyway, so it might be desirable to convert to (say) HDF5 and then throw away the text file. Anne On 10 August 2011 15:43, Derek Homeier <derek@astro.physik.uni-goettingen.de> wrote:
On 10 Aug 2011, at 19:22, Russell E. Owen wrote:
A coworker is trying to load a 1Gb text data file into a numpy array using numpy.loadtxt, but he says it is using up all of his machine's 6Gb of RAM. Is there a more efficient way to read such text data files?
The npyio routines (loadtxt as well as genfromtxt) first read in the entire data as lists, which creates of course significant overhead, but is not easy to circumvent, since numpy arrays are immutable - so you have to first store the numbers in some kind of mutable object. One could write a custom parser that tries to be somewhat more efficient, e.g. first reading in sub-arrays from a smaller buffer. Concatenating those sub-arrays would still require about twice the memory of the final array. I don't know if using the array.array type (which is mutable) is much more efficient than a list... To really avoid any excess memory usage you'd have to know the total data size in advance - either by reading in the file in a first pass to count the rows, or explicitly specifying it to a custom reader. Basically, assuming a completely regular file without missing values etc., you could then read in the data like
X = np.zeros((n_lines, n_columns), dtype=float) delimiter = ' ' for n, line in enumerate(file(fname, 'r')): X[n] = np.array(line.split(delimiter), dtype=float)
(adjust delimiter and dtype as needed...)
HTH, Derek
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
On 10 Aug 2011, at 22:03, Gael Varoquaux wrote:
On Wed, Aug 10, 2011 at 04:01:37PM -0400, Anne Archibald wrote:
A 1 Gb text file is a miserable object anyway, so it might be desirable to convert to (say) HDF5 and then throw away the text file.
+1
There might be concerns about ensuring data accessibility agains throwing the text file away, but converting to HDF5 would be an elegant for reading in without the memory issues, too (I must confess though, I've regularly read ~ 1GB ASCII files into memory - with decent virtual memory management it did not turn out too bad...) Cheers, Derek
On 10. aug. 2011, at 21.03, Gael Varoquaux wrote:
On Wed, Aug 10, 2011 at 04:01:37PM -0400, Anne Archibald wrote:
A 1 Gb text file is a miserable object anyway, so it might be desirable to convert to (say) HDF5 and then throw away the text file.
+1
G
+1 and a very warm recommendation of h5py. Paul
In article <CANm_+Zqmsgo8Q+Oz_0RCya-hJv4Q7PqynDb=LzrgvbTxGY3MWQ@mail.gmail.com>, Anne Archibald <aarchiba@physics.mcgill.ca> wrote:
There was also some work on a semi-mutable array type that allowed appending along one axis, then 'freezing' to yield a normal numpy array (unfortunately I'm not sure how to find it in the mailing list archives). One could write such a setup by hand, using mmap() or realloc(), but I'd be inclined to simply write a filter that converted the text file to some sort of binary file on the fly, value by value. Then the file can be loaded in or mmap()ed. A 1 Gb text file is a miserable object anyway, so it might be desirable to convert to (say) HDF5 and then throw away the text file.
Thank you and the others for your help. It seems a shame that loadtxt has no argument for predicted length, which would allow preallocation and less appending/copying data. And yes...reading the whole file first to figure out how many elements it has seems sensible to me -- at least as a switchable behavior, and preferably the default. 1Gb isn't that large in modern systems, but loadtxt is filing up all 6Gb of RAM reading it! I'll suggest the HDF5 solution to my colleague. Meanwhile I think he's hacked around the problem by reading the file through once to figure out the array length, allocating that, and reading the data in with a Python loop. Sounds slow, but it's working. -- Russell
On 11.08.2011, at 8:50PM, Russell E. Owen wrote:
It seems a shame that loadtxt has no argument for predicted length, which would allow preallocation and less appending/copying data.
And yes...reading the whole file first to figure out how many elements it has seems sensible to me -- at least as a switchable behavior, and preferably the default. 1Gb isn't that large in modern systems, but loadtxt is filing up all 6Gb of RAM reading it!
1 GB is indeed not much in terms of disk space these days, but using text files for such data amounts is nonetheless very much non-state-of-the-art ;-) That said, of course there is no justification to use excessive amounts of memory where it could be avoided! Implementing the above scheme for npyio is not quite as straightforward as in the example I gave before, mainly for the following reasons: loadtxt also has to deal with more complex data like structured arrays, plus comments, empty lines etc., meaning it has to find and count the actual valid data lines. Ideally, genfromtxt, which offers yet more functionality to deal with missing data, should offer the same options, but they would be certainly more difficult to implement there. More than 6 GB is still remarkable - from what info I found in the web, lists seem to consume ~24 Bytes/element, i.e. 3 times more than a final float64 array. The text representation would typically take 10-20 char's for one float (though with <12 digits, they could usually be read as float32 without loss of precision). Thus a factor >6 seems quite extreme, unless the file is full of (relatively) short integers... But this also means copying of the final array would still have a relatively low memory footprint compared to the buffer list, thus using some kind of mutable array type for reading should be a reasonable solution as well. Unfortunately fromiter is not of that much use here since it only reads 1D-arrays. I haven't tried to use Chris' accumulator class yet, so for now I did go the 2x read approach with loadtxt, it turned out to add only ~10% to the read-in time. For compressed files this goes up to 30-50%, but once physical memory is exhausted it should probably actually become faster. I've made a pull request https://github.com/numpy/numpy/pull/144 implementing that option as a switch 'prescan'; could you review it in particular regarding the following: Is the option reasonably named and documented? In the case the allocated array does not match the input data (which really should never happen), right now just a warning is issued, filling any excess buffer with zeros or discarding remaining input data - should this rather raise an IndexError? No prediction if/when I might be able to provide this for genfromtxt, sorry! Cheers, Derek
In article <781AF0C6-B761-4ABB-9798-9385582536E5@astro.physik.uni-goettingen.de>, Derek Homeier <derek@astro.physik.uni-goettingen.de> wrote:
On 11.08.2011, at 8:50PM, Russell E. Owen wrote:
It seems a shame that loadtxt has no argument for predicted length, which would allow preallocation and less appending/copying data.
And yes...reading the whole file first to figure out how many elements it has seems sensible to me -- at least as a switchable behavior, and preferably the default. 1Gb isn't that large in modern systems, but loadtxt is filing up all 6Gb of RAM reading it!
1 GB is indeed not much in terms of disk space these days, but using text files for such data amounts is nonetheless very much non-state-of-the-art ;-) That said, of course there is no justification to use excessive amounts of memory where it could be avoided! Implementing the above scheme for npyio is not quite as straightforward as in the example I gave before, mainly for the following reasons:
loadtxt also has to deal with more complex data like structured arrays, plus comments, empty lines etc., meaning it has to find and count the actual valid data lines.
Ideally, genfromtxt, which offers yet more functionality to deal with missing data, should offer the same options, but they would be certainly more difficult to implement there.
More than 6 GB is still remarkable - from what info I found in the web, lists seem to consume ~24 Bytes/element, i.e. 3 times more than a final float64 array. The text representation would typically take 10-20 char's for one float (though with <12 digits, they could usually be read as float32 without loss of precision). Thus a factor >6 seems quite extreme, unless the file is full of (relatively) short integers... But this also means copying of the final array would still have a relatively low memory footprint compared to the buffer list, thus using some kind of mutable array type for reading should be a reasonable solution as well. Unfortunately fromiter is not of that much use here since it only reads 1D-arrays. I haven't tried to use Chris' accumulator class yet, so for now I did go the 2x read approach with loadtxt, it turned out to add only ~10% to the read-in time. For compressed files this goes up to 30-50%, but once physical memory is exhausted it should probably actually become faster.
I've made a pull request https://github.com/numpy/numpy/pull/144 implementing that option as a switch 'prescan'; could you review it in particular regarding the following:
Is the option reasonably named and documented?
In the case the allocated array does not match the input data (which really should never happen), right now just a warning is issued, filling any excess buffer with zeros or discarding remaining input data - should this rather raise an IndexError?
No prediction if/when I might be able to provide this for genfromtxt, sorry!
Cheers, Derek
This looks like a great improvement to me! I think the name is well chosen and the help is very clear. A few comments: - Might you rename the variable "l"? It is easily confused with the digit 1. - I don't understand the l < n_valid test, so this may be off base, but I'm surprised that you first massage the data and then raise an exception. Is the massaged data any use after the exception is raised? Naively I would expect you to issue a warning instead of raising an exception if you are going to handle the error by massaging the data. (It is a pity that your patch duplicates so much parsing code, but I don't see a better way to do it. Putting conditionals in the parsing loop to decide how to handle each line based on prescan would presumably slow things down too much.) Regards, -- Russell
On 02.09.2011, at 1:47AM, Russell E. Owen wrote:
I've made a pull request https://github.com/numpy/numpy/pull/144 implementing that option as a switch 'prescan'; could you review it in particular regarding the following:
Is the option reasonably named and documented?
In the case the allocated array does not match the input data (which really should never happen), right now just a warning is issued, filling any excess buffer with zeros or discarding remaining input data - should this rather raise an IndexError?
No prediction if/when I might be able to provide this for genfromtxt, sorry!
Cheers, Derek
This looks like a great improvement to me! I think the name is well chosen and the help is very clear.
Thanks for your feedback, just a few quick comments:
A few comments: - Might you rename the variable "l"? It is easily confused with the digit 1. - I don't understand the l < n_valid test, so this may be off base, but I'm surprised that you first massage the data and then raise an exception. Is the massaged data any use after the exception is raised? Naively I would expect you to issue a warning instead of raising an exception if you are going to handle the error by massaging the data.
The exception is currently caught right after the loop, which might seem a bit illogical, but the idea was to handle both cases in one place (if l > n_valid, trying to assign to X[l] will also raise an IndexError, meaning there are data left in the input that could not be stored) - so the present version indeed just issues a warning for both cases, but that could easily be changed...
(It is a pity that your patch duplicates so much parsing code, but I don't see a better way to do it. Putting conditionals in the parsing loop to decide how to handle each line based on prescan would presumably slow things down too much.)
That was my idea behind it - otherwise I would also have considered moving it out into its own functions, but as long as the entire code more or less fits into one editor window, this did not appear an obstacle to me. The main update on the issue is however, that all this is currently on hold because some concerns have been raised about not using dynamic resizing instead (the extra reading pass would break streamed input), and we have been discussing the best way to do this in another thread related to pull request https://github.com/numpy/numpy/pull/143 (which implements similar functionality, plus a lot more, for a genfromtxt-like function). So don't be surprised if the loadtxt patch comes back later, in a completely revised form… Cheers, Derek -- ---------------------------------------------------------------- Derek Homeier Centre de Recherche Astrophysique de Lyon ENS Lyon 46, Allée d'Italie 69364 Lyon Cedex 07, France +33 1133 47272-8894 ----------------------------------------------------------------
On 8/10/2011 1:01 PM, Anne Archibald wrote:
There was also some work on a semi-mutable array type that allowed appending along one axis, then 'freezing' to yield a normal numpy array (unfortunately I'm not sure how to find it in the mailing list archives).
That was me, and here is the thread -- however, I'm on vacation, and don't have the test code, etc with me, but I found the core class. It's enclosed.
The npyio routines (loadtxt as well as genfromtxt) first read in the entire data as lists, which creates of course significant overhead, but is not easy to circumvent, since numpy arrays are immutable - so you have to first store the numbers in some kind of mutable object. One could write a custom parser that tries to be somewhat more efficient, e.g. first reading in sub-arrays from a smaller buffer. Concatenating those sub-arrays would still require about twice the memory of the final array. I don't know if using the array.array type (which is mutable) is much more efficient than a list...
Indeed, and are holding all the text as well, which is generally going to be bigger than the resulting numbers. Interesting, when I wrote accumulator, I found that it didn't, for the most part, have any performance advantage over accumlating on lists, then converting to arrays -- but there is a memory advantage, so this may be a good use case. you could do something like (untested): If your rows are all one dtype: X = accumulator(dtype=np.float32, block_shape = (num_cols,)) if they are not, then build a custon dtype to hold the rows, and use that: dt = np.dtype('%id'%num_columns) # create a dtype that holds a row #num_columns doubles in this case. # create an accumulator for that dtype X = accumulator(dtype=dt) # loop through the file to build the array: delimiter = ' ' for line in file(fname, 'r'): X.append ( np.array(line.split(delimiter), dtype=float) ) X = np.array(X) # gives a regular old array as a copy I note that converting to a regular array requires a data copy, which, if memoery is tight, might not be good. The solution would be to have a way to make a view, so you'd get a regular array from the same data (with maybe the extra buffer space) I'd like to see this calss get more mature, robust, and better performing, but so far it's worked for my use cases. Contributions welcome. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
aarrgg! I cleaned up the doc string a bit, but didn't save before sending -- here it is again, Sorry about that. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
Try the fromiter function, that will allow you to pass an iterator which can read the file line by line and not preload the whole file. file_iterator = iter(open('filename.txt') line_parser = lambda x: map(float,x.split('\t')) a=np.fromiter(itertools.imap(line_parser,file_iterator),dtype=float) You have also the option to iterate the file twice and pass the "count" argument. //Torgil On Wed, Aug 10, 2011 at 7:22 PM, Russell E. Owen <rowen@uw.edu> wrote:
A coworker is trying to load a 1Gb text data file into a numpy array using numpy.loadtxt, but he says it is using up all of his machine's 6Gb of RAM. Is there a more efficient way to read such text data files?
-- Russell
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
In article <CA+RwOBWjyY_abjijnxEPkSeRaeom608uiMYwffGaG-6XDgSdPw@mail.gmail.com>, Torgil Svensson <torgil.svensson@gmail.com> wrote:
Try the fromiter function, that will allow you to pass an iterator which can read the file line by line and not preload the whole file.
file_iterator = iter(open('filename.txt') line_parser = lambda x: map(float,x.split('\t')) a=np.fromiter(itertools.imap(line_parser,file_iterator),dtype=float)
You have also the option to iterate the file twice and pass the "count" argument.
Thanks. That sounds great! -- RUssell
participants (7)
-
Anne Archibald
-
Chris Barker
-
Derek Homeier
-
Gael Varoquaux
-
Paul Anton Letnes
-
Russell E. Owen
-
Torgil Svensson