To avoid derailing the other thread on extending .npy files, I am going to start a new thread on alternative array storage file formats using binary JSON - in case there is such a need and interest among numpy users

specifically, i want to first follow up with Bill's question below regarding loading time


On 8/25/22 11:02, Bill Ross wrote:

Can you give load times for these?


as I mentioned in the earlier reply to Robert, the most memory-efficient (i.e. fast loading, disk-mmap-able) but not necessarily disk-efficient (i.e. may result in the largest data file sizes) BJData construct to store an ND array using BJData's ND-array container.

I have to admit that both jdata and bjdata modules have not been extensively optimized for speed. with the current implementation, here are the loading time for a larger diagonal matrix (eye(10000))

a BJData file storing a single eye(10000) array using the ND-array container can be downloaded from here (file size: 1MB with zip, if decompressed, it is ~800MB, as the npy file) - this file was generated from a matlab encoder, but can be loaded using Python (see below Re Robert).

800000128 eye1e4.npy
800000014 eye1e4_bjd_raw_ndsyntax.jdb
   813721 eye1e4_bjd_zlib.jdb
   113067 eye1e4_bjd_lzma.jdb

the loading time (from an nvme drive, Ubuntu 18.04, python 3.6.9, numpy 1.19.5) for each file is listed below:

0.179s  eye1e4.npy (mmap_mode=None)
0.001s  eye1e4.npy (mmap_mode=r)
0.718s  eye1e4_bjd_raw_ndsyntax.jdb
1.474s  eye1e4_bjd_zlib.jdb
0.635s  eye1e4_bjd_lzma.jdb


clearly, mmapped loading is the fastest option without a surprise; it is true that the raw bjdata file is about 5x slower than npy loading, but given the main chunk of the data are stored identically (as contiguous buffer), I suppose with some optimization of the decoder, the gap between the two can be substantially shortened. The longer loading time of zlib/lzma (and similarly saving times) reflects a trade-off between smaller file sizes and time for compression/decompression/disk-IO.

by no means I am saying the binary JSON format is readily available to deliver better speed with its current non-optimized implementation. I just want to bright the attention to this class of formats, and highlight that it's flexibility gives abundant mechanisms to create fast, disk-mapped IO, while allowing additional benefits such as compression, unlimited metadata for future extensions etc.


> 8000128  eye5chunk.npy
> 5004297  eye5chunk_bjd_raw.jdb
>   10338  eye5chunk_bjd_zlib.jdb
>    2206  eye5chunk_bjd_lzma.jdb

For my case, I'd be curious about the time to add one 1T-entries file to another.


as I mentioned in the previous reply, bjdata is appendable, so you can simply append another array (or a slice) to the end of the file.


 
Thanks,
Bill



also related, Re @Robert's question below

Are any of them supported by a Python BJData implementation? I didn't see any option to get that done in the `bjdata` package you recommended, for example.
https://github.com/NeuroJSON/pybj/blob/a46355a0b0df0bec1817b04368a5a573358645ef/bjdata/encoder.py#L200

the bjdata module currently only support nd-array in the decoder (i.e. map such buffer to a numpy.ndarray) - should be relatively trivial to add it to the encoder though.

on the other side, the annotated format is currently supported. one can call jdata module (responsible for annotation-level encoding/decoding) as shown in my sample code, then call bjdata internally for data serialization.


Okay. Given your wording, it looked like you were claiming that the binary JSON was supported by the whole ecosystem. Rather, it seems like you can either get binary encoding OR the ecosystem support, but not both at the same time.

all in relative terms of course - json has ~100 listed parsers on it's website, MessagePack - another flavor of binary JSON - listed ~50/60 parsers, and UBJSON listed ~20 parsers. I am not familiar with npy parsers, but googling it returns only a few.

also, most binary JSON implementations provided tools to convert to JSON and back, so, in that sense, whatever JSON has in its ecosystem can be "potentially" used for binary JSON files if one wants to. There are also recent publications comparing differences between various binary JSON formats in case anyone is interested

https://github.com/ubjson/universal-binary-json/issues/115