Dear all,

originally, I have planned to make an extension of the .npy file format a dedicated follow-up pull request, but I have upgraded my current request instead, since it was not as difficult to implement as I initially thought and probably a more straight-forward solution:

https://github.com/numpy/numpy/pull/20321/

What is this pull request about? It is about appending to Numpy .npy files. Why? I see two main use cases:

creating .npy files larger than the main memory. They can, once finished, be loaded as memory maps
creating binary log files, which can be processed very efficiently without parsing

Are there not other good file formats to do this? Theoretically yes, but practically they can be pretty complex and with very little tweaking .npy could do efficient appending too.

Use case 1 is already covered by the Pip/Conda package npy-append-array I have created and getting the functionality directly into Numpy was the original goal of the pull request. This would have been possible without introducing a new file format version, just by adding some spare space in the header. During the pull request discussion it turned out that rewriting the header after each append would be desirable in case the writing program crashes to minimize data loss.

Use case 2 however would highly profit from a new file format version as it would make rewriting the header unnecessary: since efficient appending can only take place along one axis, setting shape[-1] = -1 in case of Fortran order or shape[0] = -1 otherwise (default) in the .npy header on file creation could indicate that the array size is determined by the file size: when np.load (typically with memory mapping on) gets called, it constructs the ndarray with the actual shape by replacing the -1 in the constructor call. Otherwise, the header is not modified anymore, neither on append nor on file write finish.

Concurrent appends to a single file would not be advisable and should be channeled through a single AppendArray instance. Concurrent reads while writes take place however should work relatively smooth: every time an np.load (ideally with mmap) is called, the ndarray would provide access to all data written until that time.

Currently, my pull request provides:

A definition of .npy version 4.0 that supports -1 in the shape
implementations for fortran order and non-fortran order (default) including test cases
Updated np.load
The AppendArray class that does the actual appending

Although there is a certain hassle with introducing a new .npy version, the changes themselves are very small. I could also implement a fallback mode for older Numpy installations, if someone is interested.

What do you think about such a feature, would it make sense? Anyone available for some more code review?

Best from Berlin, Michael

PS thank you so far, I could improve my npy-append-array module as well and from what I have seen so far the Numpy code readability exceeded my already high expectations.