Mailman 3 ENH: add functionality NpyAppendArray to numpy.format - NumPy-Discussion

7 Nov 2021

      Dear all,

I'd like to add the NpyAppendArray functionality, compare

https://github.com/xor2k/npy-append-array (15 Stars so far)

and

https://stackoverflow.com/a/64403144/2755796 (10 Upvotes so far)

I have prepared a pull request and want to "test the waters" as suggested
by the message I have received when creating the pull request.

So what is NpyAppendArray about?

I love the .npy file format. It is really great! I cannot appreciate the
.npy capabilities mentioned in

https://numpy.org/devdocs/reference/generated/numpy.lib.format.html

enough, especially its simplicity. No comparison with the struggles we had
with HDF5. However, there is one feature Numpy currently does not provide:
a simple, efficient, easy-to-use and safe option to append to .npy (here
the text I've used in the Github repository above):

Appending to an array created by np.save might be possible under certain
circumstances, since the .npy total header byte count is required to be
evenly divisible by 64. Thus, there might be some spare space to grow the
shape entry in the array descriptor. However, this is not guaranteed and
might randomly fail. Initialize the array with NpyAppendArray(filename)
directly so the header will be created with 64 byte of spare header space
for growth. Will this be enough? It allows for up to 10^64 >= 2^212 array
entries or data bits. Indeed, this is less than the number of atoms in the
universe. However, fully populating such an array, due to limits imposed by
quantum mechanics, would require more energy than would be needed to boil
the oceans, compare

https://hbfs.wordpress.com/2009/02/10/to-boil-the-oceans

Therefore, a wide range of use cases should be coverable with this approach.

Who could use that?

I developed and use NpyAppendArray to efficiently create .npy arrays which
are larger than the main memory and can be loaded by memory mapping later,
e.g. for Deep Learning workflows. Another use case are binary log files,
which could be created on low end embedded devices and later be processed
without parsing, optionally again using memory maps.

How does the code look like?

Here some demo code of how this would look like in practice (taken from the
test file):

def test_NpyAppendArray(tmpdir):
    arr1 = np.array([[1,2],[3,4]])
    arr2 = np.array([[1,2],[3,4],[5,6]])

    fname = os.path.join(tmpdir, 'npaa.npy')

    with format.NpyAppendArray(fname) as npaa:
        npaa.append(arr1)
        npaa.append(arr2)
        npaa.append(arr2)

    arr = np.load(fname, mmap_mode="r")
    arr_ref = np.concatenate([arr1, arr2, arr2])

    assert_array_equal(arr, arr_ref)

Some more aspects:
1. appending efficiently only works along axis=0 at least for c order
(probably different for Fortran order)
2. One could also add the 64 bytes of spare space right on np.save.
However, I cannot really judge on how much of an issue that would be to the
users of np.save and it is not really necessary since users who want to
append to .npy files would create them with NpyAppendArray anyway.
3. Probably I have forgotten something here, some time has passed since the
initial Github commit.

So what do you think? Yes/No/Maybe? It would be really nice to get some
feedback on the mailing list here!

Although this might not be perfectly consistent with the protocol, I've
created the pull request already, just to force myself to finish this up
and I'm prepared to fail if there is no interest to get NpyAppendArray
directly into numpy ;)

Best from Berlin, Michael

ENH: add functionality NpyAppendArray to numpy.format

Michael Siebert

Michael Siebert

tags

participants (1)