[Numpy-discussion] Re: An extension of the .npy file format

Aug. 25, 2022

      On Thu, Aug 25, 2022 at 10:45 AM Qianqian Fang <fangqq@gmail.com> wrote:
...
I am curious what you and other developers think about adopting
JSON/binary JSON as a similarly simple, reverse-engineering-able but
universally parsable array exchange format instead of designing another
numpy-specific binary format.
No one is really proposing another format, just a minor tweak to the
existing NPY format.

If you are proposing that numpy adopt BJData into numpy to underlay
`np.save()`, we are not very likely to for a number of reasons. However, if
you are addressing the wider community to advertise your work, by all means!
...
I am interested in this topic (as well as thoughts among numpy developers)
because I am currently working on a project - NeuroJSON (
https://neurojson.org) - funded by the US National Institute of Health.
The goal of the NeuroJSON project is to create easy-to-adopt,
easy-to-extend, and preferably human-readable data formats to help
disseminate and exchange neuroimaging data (and scientific data in
general).
Needless to say, numpy is a key toolkit that is widely used among
neuroimaging data analysis pipelines. I've seen discussions of potentially
adopting npy as a standardized way to share volumetric data (as ndarrays),
such as in this thread
https://github.com/bids-standard/bids-specification/issues/197
however, several limitations were also discussed, for example
1. npy only support a single numpy array, does not support other metadata
or other more complex data records (multiple arrays are only achieved via
multiple files)
2. no internal (i.e. data-level) compression, only file-level compression
3. although the file is simple, it still requires a parser to read/write,
and such parser is not widely available in other environments, making it
mostly limited to exchange data among python programs
4. I am not entirely sure, but I suppose it does not support sparse
matrices or special matrices (such as diagonal/band/symmetric etc) - I can
be wrong though
In the NeuroJSON project, we primarily use JSON and binary JSON
(specifically, UBJSON <https://ubjson.org/> derived BJData
<https://json.nlohmann.me/features/binary_formats/bjdata/> format) as the
underlying data exchange files. Through standardized data annotations
<https://github.com/NeuroJSON/jdata/blob/master/JData_specification.md#data-a...>,
we are able to address most of the above limitations - the generated files
are universally parsable in nearly all programming environments with
existing parsers, support complex hierarchical data, compression, and can
readily benefit from the large ecosystems of JSON (JSON-schema, JSONPath,
JSON-LD, jq, numerous parsers, web-ready, NoSQL db ...).
I don't quite know what this means. My installed version of `jq`, for
example, doesn't seem to know what to do with these files.

❯ jq --version
jq-1.6

❯ jq . eye5chunk_bjd_raw.jdb
parse error: Invalid numeric literal at line 1, column 38
...
I understand that simplicity is a key design spec here. I want to
highlight UBJSON/BJData as a competitive alternative format. It is also
designed with simplicity considered in the first place
<https://ubjson.org/#why>, yet, it allows to store hierarchical
strongly-typed complex binary data and is easily extensible.
A UBJSON/BJData parser may not necessarily longer than a npy parser, for
example, the python reader of the full spec only takes about 500 lines of
codes (including comments), similarly for a JS parser
https://github.com/NeuroJSON/pybj/blob/master/bjdata/decoder.py
https://github.com/NeuroJSON/js-bjdata/blob/master/bjdata.js
We actually did a benchmark <https://github.com/neurolabusc/MeshFormatsJS>
a few months back - the test workloads are two large 2D numerical arrays
(node, face to store surface mesh data), we compared parsing speed of
various formats in Python, MATLAB, and JS. The uncompressed BJData
(BMSHraw) reported a loading speed that is nearly as fast as reading raw
binary dump; and internally compressed BJData (BMSHz) gives the best
balance between small file sizes and loading speed, see our results here
https://pbs.twimg.com/media/FRPEdLGWYAEJe80?format=png&name=large
I want to add two quick points to echo the features you desired in npy:
1. it is not common to use mmap in reading JSON/binary JSON files, but it
is certainly possible. I recently wrote a JSON-mmap spec
<https://github.com/NeuroJSON/jsonmmap/blob/main/JSON-Mmap_Specification.md>
and a MATLAB reference implementation
<https://github.com/NeuroJSON/jsonmmap/tree/main/lib>
I think a fundamental problem here is that it looks like each element in
the array is delimited. I.e. a `float64` value starts with b'D' then the 8
IEEE-754 bytes representing the number. When we're talking about
memory-mappability, we are talking about having the on-disk representation
being exactly what it looks like in-memory, all of the IEEE-754 floats
contiguous with each other, so we can use the `np.memmap` `ndarray`
subclass to represent the on-disk data as a first-class array object. This
spec lets us mmap the binary JSON file and manipulate its contents in-place
efficiently, but that's not what is being asked for here.
...
2. UBJSON/BJData natively support append-able root-level records; JSON has
been extensively used in data streaming with appendable nd-json or
concatenated JSON (https://en.wikipedia.org/wiki/JSON_streaming)
just a quick comparison of output file sizes with a 1000x1000 unitary
diagonal matrix
# python3 -m pip install jdata bjdata
import numpy as np
import jdata as jd
x = np.eye(1000);       *# create a large array*
y = np.vsplit(x, 5);    *# split into smaller chunks*
np.save('eye5chunk.npy',y);             *# save npy*
jd.save(y, 'eye5chunk_bjd_raw.jdb');    *# save as uncompressed bjd*
jd.save(y, 'eye5chunk_bjd_zlib.jdb', {'compression':'zlib'});  *#
zlib-compressed bjd*
jd.save(y, 'eye5chunk_bjd_lzma.jdb', {'compression':'lzma'});  *#
lzma-compressed bjd*
newy=jd.load('eye5chunk_bjd_zlib.jdb'); *# loading/decoding*
newx = np.concatenate(newy);            *# regroup chunks*
newx.dtype
here are the output file sizes in bytes:
8000128  eye5chunk.npy
5004297  eye5chunk_bjd_raw.jdb
Just a note: This difference is solely due to a special representation of
`0` in 5 bytes rather than 8 (essentially, your encoder recognizes 0.0 as a
special value and uses the `float32` encoding of it). If you had any other
value making up the bulk of the file, this would be larger than the NPY due
to the additional delimiter b'D'.
...
10338  eye5chunk_bjd_zlib.jdb
   2206  eye5chunk_bjd_lzma.jdb
Qianqian
-- 
Robert Kern