A .npy file created by some appending process is a regular .npy file and does not need to be read in chunks. Processing arrays larger than the systems memory can already be done with memory mapping (numpy.load(… mmap_mode=...)), so no third-party support is needed to do so.

It is more about the ability to append to a .npy file at any time and between program runs. For example, in our case, we have a large database-like file containing all (preprocessed) images of all videos used to train a neural network. When new video data arrives, it can simply be appended to the existing .npy file. When training the neural net, the data is simply memory mapped, which happens basically instantly and does not use extra space between multiple training processes. We have tried out various fancy, advanced data formats for this task, but most of them don’t provide the memory mapping feature which is very handy to keep the time required to test a code change comfortably low - rather, they have excessive parse/decompress times. Also other libraries can also be difficult to handle, see below.

“… Be reverse engineered. Datasets often live longer than the programs that created them. A competent developer should be able to create a solution in his preferred programming language to read most NPY files that he has been given without much documentation. ..."

This is a big disadvantage with all the fancy formats out there: they require dedicated libraries. Some of these libraries don’t come with nice and free documentation (especially lacking easy-to-use/easy-to-understand code examples for the target language, e.g. C) and/or can be extremely complex, like HDF5. Yes, HDF5 has its users and is totally valid if one operates the world’s largest particle accelerator, but we have spend weeks finding some C/C++ library for it which does not expose bugs and is somehow documented. We actually failed and posted a bug which was fixed a year later or so. This can ruin entire projects - fortunately not ours, but it ate up a lot of time we could have spend more meaningful. On the other hand, I don’t see how e.g. zarr provides added value over .npy if one only needs the .npy features and maybe some append-data-along-one-axis feature. Yes, maybe there are some uses for two or three appendable axes, but I think having one axis to append to should cover a lot of use cases: this axis is typically time: video, audio, GPS, signal data in general, binary log data, "binary CSV" (lines in file): all of those only need one axis to append to.

However, due to current limitations in the .npy format, the code is more complex than it could actually be (the library initializes and checks spare space in the header) and it needs to rewrite the header every time. Both could be made unnecessary with a very small addition to the .npy file format. Data would stay continuous (no fragmentation!), there should just be a way to indicate that the actual shape of the array should derived from the file size.

Best, Michael

On 24. Aug 2022, at 19:16, Matti Picus <matti.picus@gmail.com> wrote:

Sorry for the late reply. Adding a new "*.npy" format feature to allow writing to the file in chunks is nice but seems a bit limited. As I understand the proposal, reading the file back can only be done in the chunks that were originally written. I think other libraries like zar or h5py have solved this problem in a more flexible way. Is there a reason you cannot use a third-party library to solve this? I would think if you have an array too large to write in one chunk you will need third-party support to process it anyway.

Matti

_______________________________________________
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-leave@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: michael.siebert2k@gmail.com