Mailman 3 Exporting numpy arrays to binary JSON (BJData) for better portability - NumPy-Discussion

Aug. 25, 2022

      To avoid derailing the other thread 
<https://mail.python.org/archives/list/numpy-discussion@python.org/thread/A4C...> 
on extending .npy files, I am going to start a new thread on alternative 
array storage file formats using binary JSON - in case there is such a 
need and interest among numpy users

specifically, i want to first follow up with Bill's question below 
regarding loading time

On 8/25/22 11:02, Bill Ross wrote:
...
|Can you give load times for these?|
as I mentioned in the earlier reply to Robert, the most memory-efficient 
(i.e. fast loading, disk-mmap-able) but not necessarily disk-efficient 
(i.e. may result in the largest data file sizes) BJData construct to 
store an ND array using BJData's ND-array container.

I have to admit that both jdata and bjdata modules have not been 
extensively optimized for speed. with the current implementation, here 
are the loading time for a larger diagonal matrix (eye(10000))

a BJData file storing a single eye(10000) array using the ND-array 
container can be downloaded from here 
<http://neurojson.org/wiki/upload/eye1e4_bjd_raw_ndsyntax.jdb.zip>(file 
size: 1MB with zip, if decompressed, it is ~800MB, as the npy file) - 
this file was generated from a matlab encoder, but can be loaded using 
Python (see below Re Robert).

|800000128 eye1e4.npy||
||800000014 eye1e4_bjd_raw_ndsyntax.jdb||
||   813721 eye1e4_bjd_zlib.jdb||
||   113067 eye1e4_bjd_lzma.jdb|

the loading time (from an nvme drive, Ubuntu 18.04, python 3.6.9, numpy 
1.19.5) for each file is listed below:

|0.179s  eye1e4.npy (mmap_mode=None)||
||0.001s  eye1e4.npy (mmap_mode=r)||
||0.718s  eye1e4_bjd_raw_ndsyntax.jdb||
||1.474s  eye1e4_bjd_zlib.jdb||
||0.635s  eye1e4_bjd_lzma.jdb|

clearly, mmapped loading is the fastest option without a surprise; it is 
true that the raw bjdata file is about 5x slower than npy loading, but 
given the main chunk of the data are stored identically (as contiguous 
buffer), I suppose with some optimization of the decoder, the gap 
between the two can be substantially shortened. The longer loading time 
of zlib/lzma (and similarly saving times) reflects a trade-off between 
smaller file sizes and time for compression/decompression/disk-IO.

by no means I am saying the binary JSON format is readily available to 
deliver better speed with its current non-optimized implementation. I 
just want to bright the attention to this class of formats, and 
highlight that it's flexibility gives abundant mechanisms to create 
fast, disk-mapped IO, while allowing additional benefits such as 
compression, unlimited metadata for future extensions etc.
...
|> 8000128  eye5chunk.npy||
||> 5004297  eye5chunk_bjd_raw.jdb||
||>   10338  eye5chunk_bjd_zlib.jdb||
||>    2206  eye5chunk_bjd_lzma.jdb|
For my case, I'd be curious about the time to add one 1T-entries file 
to another.
as I mentioned in the previous reply, bjdata is appendable 
<https://github.com/NeuroJSON/bjdata/blob/master/images/BJData_Diagram.pdf>, 
so you can simply append another array (or a slice) to the end of the file.
...
Thanks,
Bill
also related, Re @Robert's question below
...
Are any of them supported by a Python BJData implementation? I didn't 
see any option to get that done in the `bjdata` package you 
recommended, for example.
https://github.com/NeuroJSON/pybj/blob/a46355a0b0df0bec1817b04368a5a57335864...
the bjdata module currently only support nd-array in the decoder 
<https://github.com/NeuroJSON/pybj/blob/a46355a0b0df0bec1817b04368a5a57335864...> 
(i.e. map such buffer to a numpy.ndarray) - should be relatively trivial 
to add it to the encoder though.

on the other side, the annotated format is currently supported. one can 
call jdata module (responsible for annotation-level encoding/decoding) 
as shown in my sample code, then call bjdata internally for data 
serialization.
...
Okay. Given your wording, it looked like you were claiming that the 
binary JSON was supported by the whole ecosystem. Rather, it seems 
like you can either get binary encoding OR the ecosystem support, but 
not both at the same time.
all in relative terms of course - json has ~100 listed parsers on it's 
website <https://www.json.org/json-en.html>, MessagePack - another 
flavor of binary JSON - listed <https://msgpack.org/index.html> ~50/60 
parsers, and UBJSON listed <https://ubjson.org/libraries/> ~20 parsers. 
I am not familiar with npy parsers, but googling it returns only a few.

also, most binary JSON implementations provided tools to convert to JSON 
and back, so, in that sense, whatever JSON has in its ecosystem can be 
"potentially" used for binary JSON files if one wants to. There are also 
recent publications comparing differences between various binary JSON 
formats in case anyone is interested

https://github.com/ubjson/universal-binary-json/issues/115

Exporting numpy arrays to binary JSON (BJData) for better portability

Qianqian Fang

Bill Ross

Neal Becker

Qianqian Fang

Francesc Alted

Qianqian Fang

Stephan Hoyer

Qianqian Fang

Francesc Alted

Qianqian Fang

Francesc Alted

tags

participants (5)