ENH: add functionality NpyAppendArray to numpy.format
Dear all, I'd like to add the NpyAppendArray functionality, compare https://github.com/xor2k/npy-append-array (15 Stars so far) and https://stackoverflow.com/a/64403144/2755796 (10 Upvotes so far) I have prepared a pull request and want to "test the waters" as suggested by the message I have received when creating the pull request. So what is NpyAppendArray about? I love the .npy file format. It is really great! I cannot appreciate the .npy capabilities mentioned in https://numpy.org/devdocs/reference/generated/numpy.lib.format.html enough, especially its simplicity. No comparison with the struggles we had with HDF5. However, there is one feature Numpy currently does not provide: a simple, efficient, easy-to-use and safe option to append to .npy (here the text I've used in the Github repository above): Appending to an array created by np.save might be possible under certain circumstances, since the .npy total header byte count is required to be evenly divisible by 64. Thus, there might be some spare space to grow the shape entry in the array descriptor. However, this is not guaranteed and might randomly fail. Initialize the array with NpyAppendArray(filename) directly so the header will be created with 64 byte of spare header space for growth. Will this be enough? It allows for up to 10^64 >= 2^212 array entries or data bits. Indeed, this is less than the number of atoms in the universe. However, fully populating such an array, due to limits imposed by quantum mechanics, would require more energy than would be needed to boil the oceans, compare https://hbfs.wordpress.com/2009/02/10/to-boil-the-oceans Therefore, a wide range of use cases should be coverable with this approach. Who could use that? I developed and use NpyAppendArray to efficiently create .npy arrays which are larger than the main memory and can be loaded by memory mapping later, e.g. for Deep Learning workflows. Another use case are binary log files, which could be created on low end embedded devices and later be processed without parsing, optionally again using memory maps. How does the code look like? Here some demo code of how this would look like in practice (taken from the test file): def test_NpyAppendArray(tmpdir): arr1 = np.array([[1,2],[3,4]]) arr2 = np.array([[1,2],[3,4],[5,6]]) fname = os.path.join(tmpdir, 'npaa.npy') with format.NpyAppendArray(fname) as npaa: npaa.append(arr1) npaa.append(arr2) npaa.append(arr2) arr = np.load(fname, mmap_mode="r") arr_ref = np.concatenate([arr1, arr2, arr2]) assert_array_equal(arr, arr_ref) Some more aspects: 1. appending efficiently only works along axis=0 at least for c order (probably different for Fortran order) 2. One could also add the 64 bytes of spare space right on np.save. However, I cannot really judge on how much of an issue that would be to the users of np.save and it is not really necessary since users who want to append to .npy files would create them with NpyAppendArray anyway. 3. Probably I have forgotten something here, some time has passed since the initial Github commit. So what do you think? Yes/No/Maybe? It would be really nice to get some feedback on the mailing list here! Although this might not be perfectly consistent with the protocol, I've created the pull request already, just to force myself to finish this up and I'm prepared to fail if there is no interest to get NpyAppendArray directly into numpy ;) Best from Berlin, Michael
My memories reappeared: 3. One could think about allowing variable sized .npy files without header modification at all, e.g. by setting the variable sized shape entry (axis 0) to -1. The length of the array would then be inferred by the file size. However, what I personally dislike about that approach is that given a .npy file, it would be impossible to determine whether it was actually completed or for some reason data got lost, e.g. by an incomplete download. Indeed, the mere length is not as reliable as e.g. a sha256 sum, but still better than nothing. Could this be a thing or is this maybe the preferable solution after all? On Sun, Nov 7, 2021 at 6:11 PM Michael Siebert <michael.siebert2k@gmail.com> wrote:
Dear all,
I'd like to add the NpyAppendArray functionality, compare
https://github.com/xor2k/npy-append-array (15 Stars so far)
and
https://stackoverflow.com/a/64403144/2755796 (10 Upvotes so far)
I have prepared a pull request and want to "test the waters" as suggested by the message I have received when creating the pull request.
So what is NpyAppendArray about?
I love the .npy file format. It is really great! I cannot appreciate the .npy capabilities mentioned in
https://numpy.org/devdocs/reference/generated/numpy.lib.format.html
enough, especially its simplicity. No comparison with the struggles we had with HDF5. However, there is one feature Numpy currently does not provide: a simple, efficient, easy-to-use and safe option to append to .npy (here the text I've used in the Github repository above):
Appending to an array created by np.save might be possible under certain circumstances, since the .npy total header byte count is required to be evenly divisible by 64. Thus, there might be some spare space to grow the shape entry in the array descriptor. However, this is not guaranteed and might randomly fail. Initialize the array with NpyAppendArray(filename) directly so the header will be created with 64 byte of spare header space for growth. Will this be enough? It allows for up to 10^64 >= 2^212 array entries or data bits. Indeed, this is less than the number of atoms in the universe. However, fully populating such an array, due to limits imposed by quantum mechanics, would require more energy than would be needed to boil the oceans, compare
https://hbfs.wordpress.com/2009/02/10/to-boil-the-oceans
Therefore, a wide range of use cases should be coverable with this approach.
Who could use that?
I developed and use NpyAppendArray to efficiently create .npy arrays which are larger than the main memory and can be loaded by memory mapping later, e.g. for Deep Learning workflows. Another use case are binary log files, which could be created on low end embedded devices and later be processed without parsing, optionally again using memory maps.
How does the code look like?
Here some demo code of how this would look like in practice (taken from the test file):
def test_NpyAppendArray(tmpdir): arr1 = np.array([[1,2],[3,4]]) arr2 = np.array([[1,2],[3,4],[5,6]])
fname = os.path.join(tmpdir, 'npaa.npy')
with format.NpyAppendArray(fname) as npaa: npaa.append(arr1) npaa.append(arr2) npaa.append(arr2)
arr = np.load(fname, mmap_mode="r") arr_ref = np.concatenate([arr1, arr2, arr2])
assert_array_equal(arr, arr_ref)
Some more aspects: 1. appending efficiently only works along axis=0 at least for c order (probably different for Fortran order) 2. One could also add the 64 bytes of spare space right on np.save. However, I cannot really judge on how much of an issue that would be to the users of np.save and it is not really necessary since users who want to append to .npy files would create them with NpyAppendArray anyway. 3. Probably I have forgotten something here, some time has passed since the initial Github commit.
So what do you think? Yes/No/Maybe? It would be really nice to get some feedback on the mailing list here!
Although this might not be perfectly consistent with the protocol, I've created the pull request already, just to force myself to finish this up and I'm prepared to fail if there is no interest to get NpyAppendArray directly into numpy ;)
Best from Berlin, Michael
participants (1)
-
Michael Siebert