An extension of the .npy file format
Dear all, originally, I have planned to make an extension of the .npy file format a dedicated follow-up pull request, but I have upgraded my current request instead, since it was not as difficult to implement as I initially thought and probably a more straight-forward solution: https://github.com/numpy/numpy/pull/20321/ What is this pull request about? It is about appending to Numpy .npy files. Why? I see two main use cases: 1. creating .npy files larger than the main memory. They can, once finished, be loaded as memory maps 2. creating binary log files, which can be processed very efficiently without parsing Are there not other good file formats to do this? Theoretically yes, but practically they can be pretty complex and with very little tweaking .npy could do efficient appending too. Use case 1 is already covered by the Pip/Conda package npy-append-array I have created and getting the functionality directly into Numpy was the original goal of the pull request. This would have been possible without introducing a new file format version, just by adding some spare space in the header. During the pull request discussion it turned out that rewriting the header after each append would be desirable in case the writing program crashes to minimize data loss. Use case 2 however would highly profit from a new file format version as it would make rewriting the header unnecessary: since efficient appending can only take place along one axis, setting shape[-1] = -1 in case of Fortran order or shape[0] = -1 otherwise (default) in the .npy header on file creation could indicate that the array size is determined by the file size: when np.load (typically with memory mapping on) gets called, it constructs the ndarray with the actual shape by replacing the -1 in the constructor call. Otherwise, the header is not modified anymore, neither on append nor on file write finish. Concurrent appends to a single file would not be advisable and should be channeled through a single AppendArray instance. Concurrent reads while writes take place however should work relatively smooth: every time an np.load (ideally with mmap) is called, the ndarray would provide access to all data written until that time. Currently, my pull request provides: 1. A definition of .npy version 4.0 that supports -1 in the shape 2. implementations for fortran order and non-fortran order (default) including test cases 3. Updated np.load 4. The AppendArray class that does the actual appending Although there is a certain hassle with introducing a new .npy version, the changes themselves are very small. I could also implement a fallback mode for older Numpy installations, if someone is interested. What do you think about such a feature, would it make sense? Anyone available for some more code review? Best from Berlin, Michael PS thank you so far, I could improve my npy-append-array module as well and from what I have seen so far the Numpy code readability exceeded my already high expectations.
Sorry for the late reply. Adding a new "*.npy" format feature to allow writing to the file in chunks is nice but seems a bit limited. As I understand the proposal, reading the file back can only be done in the chunks that were originally written. I think other libraries like zar or h5py have solved this problem in a more flexible way. Is there a reason you cannot use a third-party library to solve this? I would think if you have an array too large to write in one chunk you will need third-party support to process it anyway. Matti
Hi Matti, hi all, @Matti: I don’t know what exactly you are referring to (Pull request or the Github project, links see below). Maybe some clarification is needed, which I hereby try to do ;) A .npy file created by some appending process is a regular .npy file and does not need to be read in chunks. Processing arrays larger than the systems memory can already be done with memory mapping (numpy.load(… mmap_mode=...)), so no third-party support is needed to do so. The idea is not necessarily to only write some known-but-fragmented content to a .npy file in chunks or to only handle files larger than the RAM. It is more about the ability to append to a .npy file at any time and between program runs. For example, in our case, we have a large database-like file containing all (preprocessed) images of all videos used to train a neural network. When new video data arrives, it can simply be appended to the existing .npy file. When training the neural net, the data is simply memory mapped, which happens basically instantly and does not use extra space between multiple training processes. We have tried out various fancy, advanced data formats for this task, but most of them don’t provide the memory mapping feature which is very handy to keep the time required to test a code change comfortably low - rather, they have excessive parse/decompress times. Also other libraries can also be difficult to handle, see below. The .npy array format is designed to be limited. There is a NEP for it, which summarizes the .npy features and concepts very well: https://numpy.org/neps/nep-0001-npy-format.html <https://numpy.org/neps/nep-0001-npy-format.html> One of my favorite features (besides memory mapping perhaps) is this one: “… Be reverse engineered. Datasets often live longer than the programs that created them. A competent developer should be able to create a solution in his preferred programming language to read most NPY files that he has been given without much documentation. ..." This is a big disadvantage with all the fancy formats out there: they require dedicated libraries. Some of these libraries don’t come with nice and free documentation (especially lacking easy-to-use/easy-to-understand code examples for the target language, e.g. C) and/or can be extremely complex, like HDF5. Yes, HDF5 has its users and is totally valid if one operates the world’s largest particle accelerator, but we have spend weeks finding some C/C++ library for it which does not expose bugs and is somehow documented. We actually failed and posted a bug which was fixed a year later or so. This can ruin entire projects - fortunately not ours, but it ate up a lot of time we could have spend more meaningful. On the other hand, I don’t see how e.g. zarr provides added value over .npy if one only needs the .npy features and maybe some append-data-along-one-axis feature. Yes, maybe there are some uses for two or three appendable axes, but I think having one axis to append to should cover a lot of use cases: this axis is typically time: video, audio, GPS, signal data in general, binary log data, "binary CSV" (lines in file): all of those only need one axis to append to. The .npy format is so simple, it can be read e.g. in C in a few lines. Or accessed easily through Numpy and ctypes by pointers for high speed custom logic - not even requiring libraries besides Numpy. Making .npy appendable is easy to implement. Yes, appending along one axis is limited as the .npy format itself. But I consider that rather to be a feature than a (actual) limitation, as it allows for fast and simple appends. The question is if there is some support for an append-to-.npy-files-along-one-axis feature in the Numpy community and if so, about the details of the actual implementation. I made one suggestion in https://github.com/numpy/numpy/pull/20321/ <https://github.com/numpy/numpy/pull/20321/> and I offer to invest time to update/modify/finalize the PR. I’ve also created a library that can already append to .npy: https://github.com/xor2k/npy-append-array <https://github.com/xor2k/npy-append-array> However, due to current limitations in the .npy format, the code is more complex than it could actually be (the library initializes and checks spare space in the header) and it needs to rewrite the header every time. Both could be made unnecessary with a very small addition to the .npy file format. Data would stay continuous (no fragmentation!), there should just be a way to indicate that the actual shape of the array should derived from the file size. Best, Michael
On 24. Aug 2022, at 19:16, Matti Picus <matti.picus@gmail.com> wrote:
Sorry for the late reply. Adding a new "*.npy" format feature to allow writing to the file in chunks is nice but seems a bit limited. As I understand the proposal, reading the file back can only be done in the chunks that were originally written. I think other libraries like zar or h5py have solved this problem in a more flexible way. Is there a reason you cannot use a third-party library to solve this? I would think if you have an array too large to write in one chunk you will need third-party support to process it anyway.
Matti
_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-leave@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: michael.siebert2k@gmail.com
I am curious what you and other developers think about adopting JSON/binary JSON as a similarly simple, reverse-engineering-able but universally parsable array exchange format instead of designing another numpy-specific binary format. I am interested in this topic (as well as thoughts among numpy developers) because I am currently working on a project - NeuroJSON (https://neurojson.org) - funded by the US National Institute of Health. The goal of the NeuroJSON project is to create easy-to-adopt, easy-to-extend, and preferably human-readable data formats to help disseminate and exchange neuroimaging data (and scientific data in general). Needless to say, numpy is a key toolkit that is widely used among neuroimaging data analysis pipelines. I've seen discussions of potentially adopting npy as a standardized way to share volumetric data (as ndarrays), such as in this thread https://github.com/bids-standard/bids-specification/issues/197 however, several limitations were also discussed, for example 1. npy only support a single numpy array, does not support other metadata or other more complex data records (multiple arrays are only achieved via multiple files) 2. no internal (i.e. data-level) compression, only file-level compression 3. although the file is simple, it still requires a parser to read/write, and such parser is not widely available in other environments, making it mostly limited to exchange data among python programs 4. I am not entirely sure, but I suppose it does not support sparse matrices or special matrices (such as diagonal/band/symmetric etc) - I can be wrong though In the NeuroJSON project, we primarily use JSON and binary JSON (specifically, UBJSON <https://ubjson.org/> derived BJData <https://json.nlohmann.me/features/binary_formats/bjdata/> format) as the underlying data exchange files. Through standardized data annotations <https://github.com/NeuroJSON/jdata/blob/master/JData_specification.md#data-annotation-keywords>, we are able to address most of the above limitations - the generated files are universally parsable in nearly all programming environments with existing parsers, support complex hierarchical data, compression, and can readily benefit from the large ecosystems of JSON (JSON-schema, JSONPath, JSON-LD, jq, numerous parsers, web-ready, NoSQL db ...). I understand that simplicity is a key design spec here. I want to highlight UBJSON/BJData as a competitive alternative format. It is also designed with simplicity considered in the first place <https://ubjson.org/#why>, yet, it allows to store hierarchical strongly-typed complex binary data and is easily extensible. A UBJSON/BJData parser may not necessarily longer than a npy parser, for example, the python reader of the full spec only takes about 500 lines of codes (including comments), similarly for a JS parser https://github.com/NeuroJSON/pybj/blob/master/bjdata/decoder.py https://github.com/NeuroJSON/js-bjdata/blob/master/bjdata.js We actually did a benchmark <https://github.com/neurolabusc/MeshFormatsJS> a few months back - the test workloads are two large 2D numerical arrays (node, face to store surface mesh data), we compared parsing speed of various formats in Python, MATLAB, and JS. The uncompressed BJData (BMSHraw) reported a loading speed that is nearly as fast as reading raw binary dump; and internally compressed BJData (BMSHz) gives the best balance between small file sizes and loading speed, see our results here https://pbs.twimg.com/media/FRPEdLGWYAEJe80?format=png&name=large I want to add two quick points to echo the features you desired in npy: 1. it is not common to use mmap in reading JSON/binary JSON files, but it is certainly possible. I recently wrote a JSON-mmap spec <https://github.com/NeuroJSON/jsonmmap/blob/main/JSON-Mmap_Specification.md> and a MATLAB reference implementation <https://github.com/NeuroJSON/jsonmmap/tree/main/lib> 2. UBJSON/BJData natively support append-able root-level records; JSON has been extensively used in data streaming with appendable nd-json or concatenated JSON (https://en.wikipedia.org/wiki/JSON_streaming) just a quick comparison of output file sizes with a 1000x1000 unitary diagonal matrix |# python3 -m pip install jdata bjdata|| ||import numpy as np|| ||import jdata as jd|| ||x = np.eye(1000); *# create a large array*|| ||y = np.vsplit(x, 5); *# split into smaller chunks*|| ||np.save('eye5chunk.npy',y); *# save npy*|| ||jd.save(y, 'eye5chunk_bjd_raw.jdb'); *# save as uncompressed bjd*|| ||jd.save(y, 'eye5chunk_bjd_zlib.jdb', {'compression':'zlib'}); *# zlib-compressed bjd*|| ||jd.save(y, 'eye5chunk_bjd_lzma.jdb', {'compression':'lzma'}); *# lzma-compressed bjd*|| ||newy=jd.load('eye5chunk_bjd_zlib.jdb'); *# loading/decoding*|| ||newx = np.concatenate(newy); *# regroup chunks*|| ||newx.dtype| here are the output file sizes in bytes: |8000128 eye5chunk.npy|| ||5004297 eye5chunk_bjd_raw.jdb|| || 10338 eye5chunk_bjd_zlib.jdb|| || 2206 eye5chunk_bjd_lzma.jdb| Qianqian On 8/24/22 15:48, Michael Siebert wrote:
Hi Matti, hi all,
@Matti: I don’t know what exactly you are referring to (Pull request or the Github project, links see below). Maybe some clarification is needed, which I hereby try to do ;)
A .npy file created by some appending process is a regular .npy file and does not need to be read in chunks. Processing arrays larger than the systems memory can already be done with memory mapping (numpy.load(… mmap_mode=...)), so no third-party support is needed to do so.
The idea is not necessarily to only write some known-but-fragmented content to a .npy file in chunks or to only handle files larger than the RAM.
It is more about the ability to append to a .npy file at any time and between program runs. For example, in our case, we have a large database-like file containing all (preprocessed) images of all videos used to train a neural network. When new video data arrives, it can simply be appended to the existing .npy file. When training the neural net, the data is simply memory mapped, which happens basically instantly and does not use extra space between multiple training processes. We have tried out various fancy, advanced data formats for this task, but most of them don’t provide the memory mapping feature which is very handy to keep the time required to test a code change comfortably low - rather, they have excessive parse/decompress times. Also other libraries can also be difficult to handle, see below.
The .npy array format is designed to be limited. There is a NEP for it, which summarizes the .npy features and concepts very well:
https://numpy.org/neps/nep-0001-npy-format.html
One of my favorite features (besides memory mapping perhaps) is this one:
“… Be reverse engineered. Datasets often live longer than the programs that created them. A competent developer should be able to create a solution in his preferred programming language to read most NPY files that he has been given without much documentation. ..."
This is a big disadvantage with all the fancy formats out there: they require dedicated libraries. Some of these libraries don’t come with nice and free documentation (especially lacking easy-to-use/easy-to-understand code examples for the target language, e.g. C) and/or can be extremely complex, like HDF5. Yes, HDF5 has its users and is totally valid if one operates the world’s largest particle accelerator, but we have spend weeks finding some C/C++ library for it which does not expose bugs and is somehow documented. We actually failed and posted a bug which was fixed a year later or so. This can ruin entire projects - fortunately not ours, but it ate up a lot of time we could have spend more meaningful. On the other hand, I don’t see how e.g. zarr provides added value over .npy if one only needs the .npy features and maybe some append-data-along-one-axis feature. Yes, maybe there are some uses for two or three appendable axes, but I think having one axis to append to should cover a lot of use cases: this axis is typically time: video, audio, GPS, signal data in general, binary log data, "binary CSV" (lines in file): all of those only need one axis to append to.
The .npy format is so simple, it can be read e.g. in C in a few lines. Or accessed easily through Numpy and ctypes by pointers for high speed custom logic - not even requiring libraries besides Numpy.
Making .npy appendable is easy to implement. Yes, appending along one axis is limited as the .npy format itself. But I consider that rather to be a feature than a (actual) limitation, as it allows for fast and simple appends.
The question is if there is some support for an append-to-.npy-files-along-one-axis feature in the Numpy community and if so, about the details of the actual implementation. I made one suggestion in
https://github.com/numpy/numpy/pull/20321/
and I offer to invest time to update/modify/finalize the PR. I’ve also created a library that can already append to .npy:
https://github.com/xor2k/npy-append-array
However, due to current limitations in the .npy format, the code is more complex than it could actually be (the library initializes and checks spare space in the header) and it needs to rewrite the header every time. Both could be made unnecessary with a very small addition to the .npy file format. Data would stay continuous (no fragmentation!), there should just be a way to indicate that the actual shape of the array should derived from the file size.
Best, Michael
On 24. Aug 2022, at 19:16, Matti Picus <matti.picus@gmail.com> wrote:
Sorry for the late reply. Adding a new "*.npy" format feature to allow writing to the file in chunks is nice but seems a bit limited. As I understand the proposal, reading the file back can only be done in the chunks that were originally written. I think other libraries like zar or h5py have solved this problem in a more flexible way. Is there a reason you cannot use a third-party library to solve this? I would think if you have an array too large to write in one chunk you will need third-party support to process it anyway.
Matti
_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-leave@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: michael.siebert2k@gmail.com
_______________________________________________ NumPy-Discussion mailing list --numpy-discussion@python.org To unsubscribe send an email tonumpy-discussion-leave@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address:fangqq@gmail.com
Can you give load times for these?
8000128 eye5chunk.npy 5004297 eye5chunk_bjd_raw.jdb 10338 eye5chunk_bjd_zlib.jdb 2206 eye5chunk_bjd_lzma.jdb
I am curious what you and other developers think about adopting JSON/binary JSON as a similarly simple, reverse-engineering-able but universally parsable array exchange format instead of designing another numpy-specific binary format.
I am interested in this topic (as well as thoughts among numpy developers) because I am currently working on a project - NeuroJSON (https://neurojson.org) - funded by the US National Institute of Health. The goal of the NeuroJSON project is to create easy-to-adopt, easy-to-extend, and preferably human-readable data formats to help disseminate and exchange neuroimaging data (and scientific data in general).
Needless to say, numpy is a key toolkit that is widely used among neuroimaging data analysis pipelines. I've seen discussions of potentially adopting npy as a standardized way to share volumetric data (as ndarrays), such as in this thread
https://github.com/bids-standard/bids-specification/issues/197
however, several limitations were also discussed, for example
1. npy only support a single numpy array, does not support other metadata or other more complex data records (multiple arrays are only achieved via multiple files) 2. no internal (i.e. data-level) compression, only file-level compression 3. although the file is simple, it still requires a parser to read/write, and such parser is not widely available in other environments, making it mostly limited to exchange data among python programs 4. I am not entirely sure, but I suppose it does not support sparse matrices or special matrices (such as diagonal/band/symmetric etc) - I can be wrong though
In the NeuroJSON project, we primarily use JSON and binary JSON (specifically, UBJSON [1] derived BJData [2] format) as the underlying data exchange files. Through standardized data annotations [3], we are able to address most of the above limitations - the generated files are universally parsable in nearly all programming environments with existing parsers, support complex hierarchical data, compression, and can readily benefit from the large ecosystems of JSON (JSON-schema, JSONPath, JSON-LD, jq, numerous parsers, web-ready, NoSQL db ...).
I understand that simplicity is a key design spec here. I want to highlight UBJSON/BJData as a competitive alternative format. It is also designed with simplicity considered in the first place [4], yet, it allows to store hierarchical strongly-typed complex binary data and is easily extensible.
A UBJSON/BJData parser may not necessarily longer than a npy parser, for example, the python reader of the full spec only takes about 500 lines of codes (including comments), similarly for a JS parser
https://github.com/NeuroJSON/pybj/blob/master/bjdata/decoder.py https://github.com/NeuroJSON/js-bjdata/blob/master/bjdata.js
We actually did a benchmark [5] a few months back - the test workloads are two large 2D numerical arrays (node, face to store surface mesh data), we compared parsing speed of various formats in Python, MATLAB, and JS. The uncompressed BJData (BMSHraw) reported a loading speed that is nearly as fast as reading raw binary dump; and internally compressed BJData (BMSHz) gives the best balance between small file sizes and loading speed, see our results here
https://pbs.twimg.com/media/FRPEdLGWYAEJe80?format=png&name=large
I want to add two quick points to echo the features you desired in npy:
1. it is not common to use mmap in reading JSON/binary JSON files, but it is certainly possible. I recently wrote a JSON-mmap spec [6] and a MATLAB reference implementation [7]
2. UBJSON/BJData natively support append-able root-level records; JSON has been extensively used in data streaming with appendable nd-json or concatenated JSON (https://en.wikipedia.org/wiki/JSON_streaming)
just a quick comparison of output file sizes with a 1000x1000 unitary diagonal matrix
# python3 -m pip install jdata bjdata import numpy as np import jdata as jd x = np.eye(1000); # create a large array y = np.vsplit(x, 5); # split into smaller chunks np.save('eye5chunk.npy',y); # save npy jd.save(y, 'eye5chunk_bjd_raw.jdb'); # save as uncompressed bjd jd.save(y, 'eye5chunk_bjd_zlib.jdb', {'compression':'zlib'}); # zlib-compressed bjd jd.save(y, 'eye5chunk_bjd_lzma.jdb', {'compression':'lzma'}); # lzma-compressed bjd newy=jd.load('eye5chunk_bjd_zlib.jdb'); # loading/decoding newx = np.concatenate(newy); # regroup chunks newx.dtype
here are the output file sizes in bytes:
8000128 eye5chunk.npy 5004297 eye5chunk_bjd_raw.jdb 10338 eye5chunk_bjd_zlib.jdb 2206 eye5chunk_bjd_lzma.jdb
Qianqian
On 8/24/22 15:48, Michael Siebert wrote: Hi Matti, hi all,
@Matti: I don't know what exactly you are referring to (Pull request or the Github project, links see below). Maybe some clarification is needed, which I hereby try to do ;)
A .npy file created by some appending process is a regular .npy file and does not need to be read in chunks. Processing arrays larger than the systems memory can already be done with memory mapping (numpy.load(... mmap_mode=...)), so no third-party support is needed to do so.
The idea is not necessarily to only write some known-but-fragmented content to a .npy file in chunks or to only handle files larger than the RAM.
It is more about the ability to append to a .npy file at any time and between program runs. For example, in our case, we have a large database-like file containing all (preprocessed) images of all videos used to train a neural network. When new video data arrives, it can simply be appended to the existing .npy file. When training the neural net, the data is simply memory mapped, which happens basically instantly and does not use extra space between multiple training processes. We have tried out various fancy, advanced data formats for this task, but most of them don't provide the memory mapping feature which is very handy to keep the time required to test a code change comfortably low - rather, they have excessive parse/decompress times. Also other libraries can also be difficult to handle, see below. The .npy array format is designed to be limited. There is a NEP for it, which summarizes the .npy features and concepts very well:
https://numpy.org/neps/nep-0001-npy-format.html
One of my favorite features (besides memory mapping perhaps) is this one:
"... Be reverse engineered. Datasets often live longer than the programs that created them. A competent developer should be able to create a solution in his preferred programming language to read most NPY files that he has been given without much documentation. ..."
This is a big disadvantage with all the fancy formats out there: they require dedicated libraries. Some of these libraries don't come with nice and free documentation (especially lacking easy-to-use/easy-to-understand code examples for the target language, e.g. C) and/or can be extremely complex, like HDF5. Yes, HDF5 has its users and is totally valid if one operates the world's largest particle accelerator, but we have spend weeks finding some C/C++ library for it which does not expose bugs and is somehow documented. We actually failed and posted a bug which was fixed a year later or so. This can ruin entire projects - fortunately not ours, but it ate up a lot of time we could have spend more meaningful. On the other hand, I don't see how e.g. zarr provides added value over .npy if one only needs the .npy features and maybe some append-data-along-one-axis feature. Yes, maybe there are some uses for two or three appendable axes, but I think having one axis to append to sho!
For my case, I'd be curious about the time to add one 1T-entries file to another. Thanks, Bill -- Phobrain.com On 2022-08-24 20:02, Qianqian Fang wrote: uld cover a lot of use cases: this axis is typically time: video, audio, GPS, signal data in general, binary log data, "binary CSV" (lines in file): all of those only need one axis to append to.
The .npy format is so simple, it can be read e.g. in C in a few lines. Or accessed easily through Numpy and ctypes by pointers for high speed custom logic - not even requiring libraries besides Numpy.
Making .npy appendable is easy to implement. Yes, appending along one axis is limited as the .npy format itself. But I consider that rather to be a feature than a (actual) limitation, as it allows for fast and simple appends.
The question is if there is some support for an append-to-.npy-files-along-one-axis feature in the Numpy community and if so, about the details of the actual implementation. I made one suggestion in
https://github.com/numpy/numpy/pull/20321/
and I offer to invest time to update/modify/finalize the PR. I've also created a library that can already append to .npy:
https://github.com/xor2k/npy-append-array
However, due to current limitations in the .npy format, the code is more complex than it could actually be (the library initializes and checks spare space in the header) and it needs to rewrite the header every time. Both could be made unnecessary with a very small addition to the .npy file format. Data would stay continuous (no fragmentation!), there should just be a way to indicate that the actual shape of the array should derived from the file size.
Best, Michael
On 24. Aug 2022, at 19:16, Matti Picus <matti.picus@gmail.com> wrote:
Sorry for the late reply. Adding a new "*.npy" format feature to allow writing to the file in chunks is nice but seems a bit limited. As I understand the proposal, reading the file back can only be done in the chunks that were originally written. I think other libraries like zar or h5py have solved this problem in a more flexible way. Is there a reason you cannot use a third-party library to solve this? I would think if you have an array too large to write in one chunk you will need third-party support to process it anyway.
Matti
_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-leave@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: michael.siebert2k@gmail.com
_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-leave@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: fangqq@gmail.com
_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-leave@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: bross_phobrain@sonic.net Links: ------ [1] https://ubjson.org/ [2] https://json.nlohmann.me/features/binary_formats/bjdata/ [3] https://github.com/NeuroJSON/jdata/blob/master/JData_specification.md#data-a... [4] https://ubjson.org/#why [5] https://github.com/neurolabusc/MeshFormatsJS [6] https://github.com/NeuroJSON/jsonmmap/blob/main/JSON-Mmap_Specification.md [7] https://github.com/NeuroJSON/jsonmmap/tree/main/lib
On Thu, Aug 25, 2022 at 10:45 AM Qianqian Fang <fangqq@gmail.com> wrote:
I am curious what you and other developers think about adopting JSON/binary JSON as a similarly simple, reverse-engineering-able but universally parsable array exchange format instead of designing another numpy-specific binary format.
No one is really proposing another format, just a minor tweak to the existing NPY format. If you are proposing that numpy adopt BJData into numpy to underlay `np.save()`, we are not very likely to for a number of reasons. However, if you are addressing the wider community to advertise your work, by all means!
I am interested in this topic (as well as thoughts among numpy developers) because I am currently working on a project - NeuroJSON ( https://neurojson.org) - funded by the US National Institute of Health. The goal of the NeuroJSON project is to create easy-to-adopt, easy-to-extend, and preferably human-readable data formats to help disseminate and exchange neuroimaging data (and scientific data in general).
Needless to say, numpy is a key toolkit that is widely used among neuroimaging data analysis pipelines. I've seen discussions of potentially adopting npy as a standardized way to share volumetric data (as ndarrays), such as in this thread
https://github.com/bids-standard/bids-specification/issues/197
however, several limitations were also discussed, for example
1. npy only support a single numpy array, does not support other metadata or other more complex data records (multiple arrays are only achieved via multiple files) 2. no internal (i.e. data-level) compression, only file-level compression 3. although the file is simple, it still requires a parser to read/write, and such parser is not widely available in other environments, making it mostly limited to exchange data among python programs 4. I am not entirely sure, but I suppose it does not support sparse matrices or special matrices (such as diagonal/band/symmetric etc) - I can be wrong though
In the NeuroJSON project, we primarily use JSON and binary JSON (specifically, UBJSON <https://ubjson.org/> derived BJData <https://json.nlohmann.me/features/binary_formats/bjdata/> format) as the underlying data exchange files. Through standardized data annotations <https://github.com/NeuroJSON/jdata/blob/master/JData_specification.md#data-annotation-keywords>, we are able to address most of the above limitations - the generated files are universally parsable in nearly all programming environments with existing parsers, support complex hierarchical data, compression, and can readily benefit from the large ecosystems of JSON (JSON-schema, JSONPath, JSON-LD, jq, numerous parsers, web-ready, NoSQL db ...).
I don't quite know what this means. My installed version of `jq`, for example, doesn't seem to know what to do with these files. ❯ jq --version jq-1.6 ❯ jq . eye5chunk_bjd_raw.jdb parse error: Invalid numeric literal at line 1, column 38
I understand that simplicity is a key design spec here. I want to highlight UBJSON/BJData as a competitive alternative format. It is also designed with simplicity considered in the first place <https://ubjson.org/#why>, yet, it allows to store hierarchical strongly-typed complex binary data and is easily extensible.
A UBJSON/BJData parser may not necessarily longer than a npy parser, for example, the python reader of the full spec only takes about 500 lines of codes (including comments), similarly for a JS parser
https://github.com/NeuroJSON/pybj/blob/master/bjdata/decoder.py https://github.com/NeuroJSON/js-bjdata/blob/master/bjdata.js
We actually did a benchmark <https://github.com/neurolabusc/MeshFormatsJS> a few months back - the test workloads are two large 2D numerical arrays (node, face to store surface mesh data), we compared parsing speed of various formats in Python, MATLAB, and JS. The uncompressed BJData (BMSHraw) reported a loading speed that is nearly as fast as reading raw binary dump; and internally compressed BJData (BMSHz) gives the best balance between small file sizes and loading speed, see our results here
https://pbs.twimg.com/media/FRPEdLGWYAEJe80?format=png&name=large
I want to add two quick points to echo the features you desired in npy:
1. it is not common to use mmap in reading JSON/binary JSON files, but it is certainly possible. I recently wrote a JSON-mmap spec <https://github.com/NeuroJSON/jsonmmap/blob/main/JSON-Mmap_Specification.md> and a MATLAB reference implementation <https://github.com/NeuroJSON/jsonmmap/tree/main/lib>
I think a fundamental problem here is that it looks like each element in the array is delimited. I.e. a `float64` value starts with b'D' then the 8 IEEE-754 bytes representing the number. When we're talking about memory-mappability, we are talking about having the on-disk representation being exactly what it looks like in-memory, all of the IEEE-754 floats contiguous with each other, so we can use the `np.memmap` `ndarray` subclass to represent the on-disk data as a first-class array object. This spec lets us mmap the binary JSON file and manipulate its contents in-place efficiently, but that's not what is being asked for here.
2. UBJSON/BJData natively support append-able root-level records; JSON has been extensively used in data streaming with appendable nd-json or concatenated JSON (https://en.wikipedia.org/wiki/JSON_streaming)
just a quick comparison of output file sizes with a 1000x1000 unitary diagonal matrix
# python3 -m pip install jdata bjdata import numpy as np import jdata as jd x = np.eye(1000); *# create a large array* y = np.vsplit(x, 5); *# split into smaller chunks* np.save('eye5chunk.npy',y); *# save npy* jd.save(y, 'eye5chunk_bjd_raw.jdb'); *# save as uncompressed bjd* jd.save(y, 'eye5chunk_bjd_zlib.jdb', {'compression':'zlib'}); *# zlib-compressed bjd* jd.save(y, 'eye5chunk_bjd_lzma.jdb', {'compression':'lzma'}); *# lzma-compressed bjd* newy=jd.load('eye5chunk_bjd_zlib.jdb'); *# loading/decoding* newx = np.concatenate(newy); *# regroup chunks* newx.dtype
here are the output file sizes in bytes:
8000128 eye5chunk.npy 5004297 eye5chunk_bjd_raw.jdb
Just a note: This difference is solely due to a special representation of `0` in 5 bytes rather than 8 (essentially, your encoder recognizes 0.0 as a special value and uses the `float32` encoding of it). If you had any other value making up the bulk of the file, this would be larger than the NPY due to the additional delimiter b'D'.
10338 eye5chunk_bjd_zlib.jdb 2206 eye5chunk_bjd_lzma.jdb
Qianqian
-- Robert Kern
On 8/25/22 12:25, Robert Kern wrote:
No one is really proposing another format, just a minor tweak to the existing NPY format.
agreed. I was just following the previous comment on alternative formats (such as hdf5) and pros/cons of npy.
I don't quite know what this means. My installed version of `jq`, for example, doesn't seem to know what to do with these files.
❯ jq --version jq-1.6
❯ jq . eye5chunk_bjd_raw.jdb parse error: Invalid numeric literal at line 1, column 38
the .jdb files are binary JSON files (specifically BJData) that jq does not currently support; to save as text-based JSON, you change the suffix to .json or .jdt - it results in ~33% increase compared to the binary due to base64 jd.save(y, 'eye5chunk_bjd_zlib.jdt', {'compression':'zlib'}); 13694 Aug 25 12:54 eye5chunk_bjd_zlib.jdt 10338 Aug 25 15:41 eye5chunk_bjd_zlib.jdb jq . eye5chunk_bjd_zlib.jdt [ { "_ArrayType_": "double", "_ArraySize_": [ 200, 1000 ], "_ArrayZipType_": "zlib", "_ArrayZipSize_": [ 1, 200000 ], "_ArrayZipData_": "..." }, ... ]
I think a fundamental problem here is that it looks like each element in the array is delimited. I.e. a `float64` value starts with b'D' then the 8 IEEE-754 bytes representing the number. When we're talking about memory-mappability, we are talking about having the on-disk representation being exactly what it looks like in-memory, all of the IEEE-754 floats contiguous with each other, so we can use the `np.memmap` `ndarray` subclass to represent the on-disk data as a first-class array object. This spec lets us mmap the binary JSON file and manipulate its contents in-place efficiently, but that's not what is being asked for here.
there are several BJData-compliant forms to store the same binary array losslessly. The most memory efficient and disk-mmapable (but not necessarily disk-efficient) form is to use the ND-array container syntax <https://github.com/NeuroJSON/bjdata/blob/Draft_2/Binary_JData_Specification.md#optimized-n-dimensional-array-of-uniform-type> that BJData spec extended over UBJSON. For example, a 100x200x300 3D float64 ($D) array can be stored as below (numbers are stored in binary forms, white spaces should be removed) |[$D #[$u#U3 100 200 300 value0 value1 ...| where the "value_i"s are contiguous (row-major) binary stream of the float64 buffer without the delimited marker ('D') because it is absorbed to the optimized header <https://ubjson.org/type-reference/container-types/#optimized-format> of the array "[" following the type "$" marker. The data chunk is mmap-able, although if you desired a pre-determined initial offset, you can force the dimension vector (#[$u #U 3 100 200 300) to be an integer type ($u) large enough, for example uint32 (m), then the starting offset of the binary stream will be entirely predictable. multiple ND arrays can be directly appended to the root level, for example, |[$D #[$u#U3 100 200 300 value0 value1 ...|| ||[$D #[$u#U3 100 200 300 value0 value1 ...|| ||[$D #[$u#U3 100 200 300 value0 value1 ...|| ||[$D #[$u#U3 100 200 300 value0 value1 ...| can store 100x200x300 chunks of a 400x200x300 array alternatively, one can also use an annotated format (in JSON form: |{"_ArrayType":"double","_ArraySize_":[100,200,300],"_ArrayData_":[value1,value2,...]}|) to store everything into 1D continuous buffer |{|||U11 _ArrayType_ S U6 double |U11 _ArraySize_ [$u#U3 100 200 300 U11 _ArrayData_ [$D #m 6000000 value1 value2 ...}| The contiguous buffer in _ArrayData_ section is also disk-mmap-able; you can also make additional requirements for the array metadata to ensure a predictable initial offset, if desired. similarly, these annotated chunks can be appended in either JSON or binary JSON forms, and the parsers can automatically handle both forms and convert them into the desired binary ND array with the expected type and dimensions.
here are the output file sizes in bytes:
|8000128 eye5chunk.npy|| ||5004297 eye5chunk_bjd_raw.jdb|
Just a note: This difference is solely due to a special representation of `0` in 5 bytes rather than 8 (essentially, your encoder recognizes 0.0 as a special value and uses the `float32` encoding of it). If you had any other value making up the bulk of the file, this would be larger than the NPY due to the additional delimiter b'D'.
the two BJData forms that I mentioned above (nd-array syntax or annotated array) will preserve the original precision/shape in round-trips. BJData follows the recommendations of the UBJSON spec and automatically reduces data size <https://ubjson.org/type-reference/value-types/#:~:text=smallest%20numeric%20type> only if no precision loss (such as integer or zeros), but it is optional.
| 10338 eye5chunk_bjd_zlib.jdb|| || 2206 eye5chunk_bjd_lzma.jdb|
Qianqian
-- Robert Kern
_______________________________________________ NumPy-Discussion mailing list --numpy-discussion@python.org To unsubscribe send an email tonumpy-discussion-leave@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address:fangqq@gmail.com
On Thu, Aug 25, 2022 at 3:47 PM Qianqian Fang <fangqq@gmail.com> wrote:
On 8/25/22 12:25, Robert Kern wrote:
I don't quite know what this means. My installed version of `jq`, for example, doesn't seem to know what to do with these files.
❯ jq --version jq-1.6
❯ jq . eye5chunk_bjd_raw.jdb parse error: Invalid numeric literal at line 1, column 38
the .jdb files are binary JSON files (specifically BJData) that jq does not currently support; to save as text-based JSON, you change the suffix to .json or .jdt - it results in ~33% increase compared to the binary due to base64
Okay. Given your wording, it looked like you were claiming that the binary JSON was supported by the whole ecosystem. Rather, it seems like you can either get binary encoding OR the ecosystem support, but not both at the same time.
I think a fundamental problem here is that it looks like each element in the array is delimited. I.e. a `float64` value starts with b'D' then the 8 IEEE-754 bytes representing the number. When we're talking about memory-mappability, we are talking about having the on-disk representation being exactly what it looks like in-memory, all of the IEEE-754 floats contiguous with each other, so we can use the `np.memmap` `ndarray` subclass to represent the on-disk data as a first-class array object. This spec lets us mmap the binary JSON file and manipulate its contents in-place efficiently, but that's not what is being asked for here.
there are several BJData-compliant forms to store the same binary array losslessly. The most memory efficient and disk-mmapable (but not necessarily disk-efficient) form is to use the ND-array container syntax <https://github.com/NeuroJSON/bjdata/blob/Draft_2/Binary_JData_Specification.md#optimized-n-dimensional-array-of-uniform-type> that BJData spec extended over UBJSON.
Are any of them supported by a Python BJData implementation? I didn't see any option to get that done in the `bjdata` package you recommended, for example.
https://github.com/NeuroJSON/pybj/blob/a46355a0b0df0bec1817b04368a5a57335864... -- Robert Kern
participants (5)
-
Bill Ross
-
Matti Picus
-
Michael Siebert
-
Qianqian Fang
-
Robert Kern