On 8/27/22 15:18, Stephan Hoyer wrote:
Hi Qianqian, 

I think you might be interested in the Zarr storage format, for exactly this same reason: https://zarr.dev/

Zarr is focused more on "big data" but one of its fundamental strengths is that the format is extremely simple. All the metadata is in JSON, with arrays divided up into smaller "chunks" stored as files on disk or in cloud object stores.


hi Stephan,

yes, I am aware of Zarr and Zarr developers have also make appearances in various neuroimaging data storage discussions.

Zarr and typical binary JSON (msgpack/ubjson) are focusing on different applications and are attacking different types of challenges.

Zarr is python-focused and large-sized ND-array/parallel array processing focused.  ideologically, it makes a great mix between the simplicity of JSON and performance/hierarchical data support from HDF5, attracting HDF5 users as a simpler alternative. Both Zarr and HDF5 datasets are heavily oriented around ND-arrays (if not exclusively).

JSON/binary JSON came entirely from the other side of the data spectrum - where heterogeneous, lightweight (scalars or short vectors) hierarchical data, such as metadata/web app data packets had been the primary focus. They are also language- and platform-neutral (like HDF5). Although JSON supports nested arrays, it doesn't really care much about the regularity of the dimensions (i.e. whether it is an ND array).

So, previously, the two types of formats did not have any common denominators between targeted data types and applications. but clearly, if you really want, either of them are syntactically capable of representing data from the other side of the spectrum (just a matter of efficiency).

I want to mention that ND numerical arrays and lightweight heterogeneous data do not cover everything scientific data storage/exchange needed - an area that both are missing contains other common data structures such as tables, graphs, lists etc. CSV/TSV or databases often fill in the table data storage needs, but introduces additional format to handle in the pipeline.

I drew a Venn diagram, can be found in the attachment, just to illustrate the scopes/strengths of various formats.

Zarr or HDF5 developers are absolutely entitled (and welcomed) to "invade" the other side of the data type spectrum. However, I decided to go the other way around, i.e., extending JSON and binary JSON to be able to store strongly-typed binary data, ND-arrays, even the middle-ground data types such as tables/graphs via the JData spec, is largely based on the consideration of taking advantage of the existing ecosystem benefit of JSON.

Regardless whether Zarr uses standard JSON to store metadata or something else, one still need to write a Zarr parser (in each needed programming language) to be able to read/write it. There is no existing parser that can automatically open it or knows how to handle it. In comparison, the data type extension JData spec made are purely in the semantic layer and does not alter the serialization syntax (UBJSON-to-BJData upgrade was an exception because UBJSON does not support NDarray, that makes it necessary). Therefore, .jdt or .jdb files with JData annotations are backward (and forward) compatible to all existing JSON or BJData parsers. So these files not only are directly readable by an editor, they are also readily parsable without specialized reader. The closest mirror I can find in the Python world is JSON tricks (https://github.com/mverleg/pyjson_tricks), but again, JSON tricks is python focused and JData spec focuses on language independent data exchange (say between Python and MATLAB or C - it started in MATLAB when I wrote JSONLab).

Zarr and HDF5 will likely hold their edges in high-performance binary array data storage/access. However, the types of the data I was tasked to find ways to encode/share/integrate are extremely heterogeneous - containing mixtures of volumetric data (ND arrays as in MRI scans), tables (.csv,.tsv), and metadata (.json) sorted in a file/folder tree, as currently standardized by the BIDS project (https://bids.neuroimaging.io/), see example datasets here

https://github.com/bids-standard/bids-examples

In such case, I found that JData-annotated JSON/binary JSON in combination with NoSQL databases (MongoDB/CouchDB) offers the most intuitive and scalable way, and requires the least amount of work, to both store such data locally as human-readable files or search in the cloud as document-based databases.

Qianqian



Cheers,
Stephan



_______________________________________________
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-leave@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: fangqq@gmail.com