saving incrementally numpy arrays
Hi, I am creating numpy arrays in chunks and I want to save the chunks while my program creates them. I tried to use numpy.save but it failed (because it is not intended to append data). I'd like to know what is, in your opinion, the best way to go. I will put a few thousands every time but building up a file of several Gbytes. I do not want to put into memory all previous data each time. Also I cannot wait until the program finishes, I must save partial results periodically. Thanks, any help will be appreciated Juan
On 10-Aug-09, at 11:29 PM, Juan Fiol wrote:
Hi, I am creating numpy arrays in chunks and I want to save the chunks while my program creates them. I tried to use numpy.save but it failed (because it is not intended to append data). I'd like to know what is, in your opinion, the best way to go. I will put a few thousands every time but building up a file of several Gbytes. I do not want to put into memory all previous data each time
PyTables sounds like a good way to go. If you need to append to arrays themselves it can do that too, but it can certainly append new arrays to a file. David
I have had some resembling challenges in my work, and here appending the
nympy arrays to HDF5 files using PyTables has been the solution for me -
that used in combination with lzo compression/decompression has lead to very
high read/write performance in my application with low memory consumption.
You may also want to have a look at the h5py package.
Kim
2009/8/11 Juan Fiol
Hi, I am creating numpy arrays in chunks and I want to save the chunks while my program creates them. I tried to use numpy.save but it failed (because it is not intended to append data). I'd like to know what is, in your opinion, the best way to go. I will put a few thousands every time but building up a file of several Gbytes. I do not want to put into memory all previous data each time. Also I cannot wait until the program finishes, I must save partial results periodically. Thanks, any help will be appreciated Juan
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
On Mon, Aug 10, 2009 at 22:29, Juan Fiol
Hi, I am creating numpy arrays in chunks and I want to save the chunks while my program creates them. I tried to use numpy.save but it failed (because it is not intended to append data). I'd like to know what is, in your opinion, the best way to go. I will put a few thousands every time but building up a file of several Gbytes. I do not want to put into memory all previous data each time. Also I cannot wait until the program finishes, I must save partial results periodically. Thanks, any help will be appreciated
As others mentioned, PyTables is an excellent, complete solution. If you still want to write your own, then you can pass an open file object to numpy.save() in order to append. Just open it with the mode 'a+b' and seek to the end. f = open('myfile.npy', 'a+b') f.seek(0, 2) numpy.save(f, chunk) f.close() -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco
On Tue, Aug 11, 2009 at 11:05 AM, Robert Kern
On Mon, Aug 10, 2009 at 22:29, Juan Fiol
wrote: Hi, I am creating numpy arrays in chunks and I want to save the chunks while my program creates them. I tried to use numpy.save but it failed (because it is not intended to append data). I'd like to know what is, in your opinion, the best way to go. I will put a few thousands every time but building up a file of several Gbytes. I do not want to put into memory all previous data each time. Also I cannot wait until the program finishes, I must save partial results periodically. Thanks, any help will be appreciated
As others mentioned, PyTables is an excellent, complete solution. If you still want to write your own, then you can pass an open file object to numpy.save() in order to append. Just open it with the mode 'a+b' and seek to the end.
f = open('myfile.npy', 'a+b') f.seek(0, 2) numpy.save(f, chunk) f.close()
That looks nice. What am I doing wrong?
x = np.array([1,2,3]) y = np.array([4,5,6])
f = open('myfile.npy', 'a+b') np.save(f, x) f.seek(0, 2) np.save(f, y) f.close()
xy = np.load('myfile.npy') xy array([1, 2, 3])
I was expecting something like array([1, 2, 3, 4, 5, 6]).
Hi, thanks for all the answers. I am checking how to use pytables now, though I probably prefer to do it without further dependencies. I tried opening the file as 'append' and then pickle the array (because looking to the numpy.save it looked like what they did), but to retrieve the data then I have to load multiple times and concatenate (numpy.c_[]). I did not tried Robert suggestion yet, but it will probably happen the same and that is what Keith is seeing (though I may be wrong too).
If I do not find a suitable solution with only numpy I'll learn how to use pytables. Thanks and Best regards,
Juan
--- On Tue, 8/11/09, Keith Goodman
On Mon, Aug 10, 2009 at 22:29, Juan Fiol
wrote: Hi, I am creating numpy arrays in chunks and I want to save the chunks while my program creates them. I
all previous data each time. Also I cannot wait until the program finishes, I must save partial results
From: Keith Goodman
Subject: Re: [Numpy-discussion] saving incrementally numpy arrays To: "Discussion of Numerical Python" Date: Tuesday, August 11, 2009, 7:46 PM On Tue, Aug 11, 2009 at 11:05 AM, Robert Kern wrote: tried to use numpy.save but it failed (because it is not intended to append data). I'd like to know what is, in your opinion, the best way to go. I will put a few thousands every time but building up a file of several Gbytes. I do not want to put into memory periodically. Thanks, any help will be appreciated As others mentioned, PyTables is an excellent,
complete solution. If
you still want to write your own, then you can pass an open file object to numpy.save() in order to append. Just open it with the mode 'a+b' and seek to the end.
f = open('myfile.npy', 'a+b') f.seek(0, 2) numpy.save(f, chunk) f.close()
That looks nice. What am I doing wrong?
x = np.array([1,2,3]) y = np.array([4,5,6])
f = open('myfile.npy', 'a+b') np.save(f, x) f.seek(0, 2) np.save(f, y) f.close()
xy = np.load('myfile.npy') xy array([1, 2, 3])
I was expecting something like array([1, 2, 3, 4, 5, 6]). _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Hi, again, I can confirm that you have to load multiple times. Also I do not see differences if using or not the f.seek line
The following snippet gives the expected result. The problem is that that way I have to load as many times as I wrote. Besides that, it works. Thanks, Juan
#-----------------------------------------
import numpy as np
x = np.array([[1,2,3],[4,5,6]])
y = np.array([[7,8,9],[10,11,12]])
f = open('myfile1.npy', 'a+b')
np.save(f, x)
# f.seek(0, 2)
np.save(f, y)
f.close()
fi=open('myfile1.npy','rb')
x1 = np.load(fi)
y1 = np.load(fi)
fi.close()
#-----------------------------------------
--- On Tue, 8/11/09, Juan Fiol
From: Juan Fiol
Subject: Re: [Numpy-discussion] saving incrementally numpy arrays To: "Discussion of Numerical Python" Date: Tuesday, August 11, 2009, 8:28 PM Hi, thanks for all the answers. I am checking how to use pytables now, though I probably prefer to do it without further dependencies. I tried opening the file as 'append' and then pickle the array (because looking to the numpy.save it looked like what they did), but to retrieve the data then I have to load multiple times and concatenate (numpy.c_[]). I did not tried Robert suggestion yet, but it will probably happen the same and that is what Keith is seeing (though I may be wrong too). If I do not find a suitable solution with only numpy I'll learn how to use pytables. Thanks and Best regards, Juan --- On Tue, 8/11/09, Keith Goodman
wrote: On Mon, Aug 10, 2009 at 22:29, Juan Fiol
wrote: Hi, I am creating numpy arrays in chunks and I want to save the chunks while my program creates them. I
all previous data each time. Also I cannot wait until the program finishes, I must save partial results
From: Keith Goodman
Subject: Re: [Numpy-discussion] saving incrementally numpy arrays To: "Discussion of Numerical Python" Date: Tuesday, August 11, 2009, 7:46 PM On Tue, Aug 11, 2009 at 11:05 AM, Robert Kern wrote: tried to use numpy.save but it failed (because it is not intended to append data). I'd like to know what is, in your opinion, the best way to go. I will put a few thousands every time but building up a file of several Gbytes. I do not want to put into memory periodically. Thanks, any help will be appreciated As others mentioned, PyTables is an excellent,
you still want to write your own, then you can
complete solution. If pass an open file
object to numpy.save() in order to append. Just open it with the mode 'a+b' and seek to the end.
f = open('myfile.npy', 'a+b') f.seek(0, 2) numpy.save(f, chunk) f.close()
That looks nice. What am I doing wrong?
x = np.array([1,2,3]) y = np.array([4,5,6])
f = open('myfile.npy', 'a+b') np.save(f, x) f.seek(0, 2) np.save(f, y) f.close()
xy = np.load('myfile.npy') xy array([1, 2, 3])
I was expecting something like array([1, 2, 3, 4, 5, 6]). _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
You can do something a bit tricky but possibly working. I made the assumption of a C-ordered 1d vector. import numpy as np import numpy.lib.format as fmt # example of chunks chunks = [np.arange(l) for l in range(5,10)] # at the beginning fp = open('myfile.npy', 'wb') d = dict( descr=fmt.dtype_to_descr(chunks[0].dtype), fortran_order=False, shape=(2**30), # some big shape you think you'll never reach ) fp.write(fmt.magic(1,0)) fmt.write_array_header_1_0(fp, d) h_len = fp.tell() l = 0 # ... for each chunk ... for chunk in chunks: l += len(chunk) fp.write(chunk.tostring('C')) # finally fp.seek(0,0) fp.write(fmt.magic(1,0)) d['shape'] = (l,) fmt.write_array_header_1_0(fp, d) fp.write(' ' * (h_len - fp.tell() - 1)) fp.close()
Hi, I finally decided by the pytables approach because will be easier later to work with the data. Now, I know is not the right place but may be I can get some quick pointers. I've calculated a numpy array of about 20 columns and a few thousands rows at each time. I'd like to append all the rows without iterating over the numpy array. Someone knows what would be the "right" approach? I am looking for something simple, I do not need to keep the piece of table after I put into the h5file. Thanks in advance and regards, Juan
--- On Tue, 8/11/09, Citi, Luca
From: Citi, Luca
Subject: Re: [Numpy-discussion] saving incrementally numpy arrays To: "Discussion of Numerical Python" Date: Tuesday, August 11, 2009, 9:26 PM You can do something a bit tricky but possibly working. I made the assumption of a C-ordered 1d vector. import numpy as np import numpy.lib.format as fmt
# example of chunks chunks = [np.arange(l) for l in range(5,10)]
# at the beginning fp = open('myfile.npy', 'wb') d = dict( descr=fmt.dtype_to_descr(chunks[0].dtype), fortran_order=False, shape=(2**30), # some big shape you think you'll never reach ) fp.write(fmt.magic(1,0)) fmt.write_array_header_1_0(fp, d) h_len = fp.tell() l = 0 # ... for each chunk ... for chunk in chunks: l += len(chunk) fp.write(chunk.tostring('C')) # finally fp.seek(0,0) fp.write(fmt.magic(1,0)) d['shape'] = (l,) fmt.write_array_header_1_0(fp, d) fp.write(' ' * (h_len - fp.tell() - 1)) fp.close()
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
On 12-Aug-09, at 7:11 PM, Juan Fiol wrote:
Hi, I finally decided by the pytables approach because will be easier later to work with the data. Now, I know is not the right place but may be I can get some quick pointers. I've calculated a numpy array of about 20 columns and a few thousands rows at each time. I'd like to append all the rows without iterating over the numpy array. Someone knows what would be the "right" approach? I am looking for something simple, I do not need to keep the piece of table after I put into the h5file. Thanks in advance and regards, Juan
You'll probably want the EArray. createEArray() on a new h5file, then append to it. http://www.pytables.org/docs/manual/ch04.html#EArrayMethodsDescr If your chunks are always the same size it might be best to try and do your work in-place and not allocate a new NumPy array each time. In theory 'del' ing the object when you're done with it should work but the garbage collector may not act quickly enough for your liking/the allocation step may start slowing you down. What do I mean? Well, you could clear the array when you're done with it using foo[:] = 0 (or nan, or whatever) and when you're "building it up" use the inplace augmented assignment operators as much as possible (+=, /=, -=, *=, %=, etc.). David
participants (6)
-
Citi, Luca
-
David Warde-Farley
-
Juan Fiol
-
Keith Goodman
-
Kim Hansen
-
Robert Kern