[Numpy-discussion] Re: Exporting numpy arrays to binary JSON (BJData) for better portability

Sept. 1, 2022

      On 8/30/22 06:29, Francesc Alted wrote:
...
Not exactly.  What we've done is to encode the header and the trailer 
(i.e. where the metadata is) of the frame with msgpack.  Thechunks 
section 
<https://github.com/Blosc/c-blosc2/blob/main/README_CFRAME_FORMAT.rst#chunks>is 
where the actual data is; this section does not follow a msgpack 
structure as such, but it is rather a sequence of data chunks and an 
index (for quickly locating the chunks).  You can easily access the 
header or trailer sections reading from the start or the end of the 
frame.  This way you don't need to update the indexes of chunks in 
msgpack, which can be expensive during data updates.
This indeed prevents data to be dumped by using typical msgpack tools, 
but our sense is that users should care mostly about metainfo, and let 
the libraries to deal with the actual data in the most efficient way.
thanks for your detailed reply. I spent the past few days reading the 
links/documentations, as well as experimenting the blosc2 
meta-compressors, I was quite impressed by the performance of blosc2. I 
was also happy to see great alignments behind the drives for Caterva 
those of NeuroJSON.

I have a few quick updates

1. I added blosc2 as a codec in my jdata module, as an alternative 
compressor to zlib/lzma/lz4

https://github.com/NeuroJSON/pyjdata/commit/ce25fa53ce73bf4cbe2cff9799b5a616...

2. as I mentioned, jdata/bjdata were not optimized for speed, they 
contain many inefficient handling of numpy arrays (as I discovered); 
after some profiling, I was able to remove most of those, the run-time 
is now nearly entirely spent in compression/decompression (see attached 
profiler outputs for the `zlib` compressor benchmark)

3. the new jdata that supports blosc2, v0.5.0, has been tagged and 
uploaded (https://pypi.org/project/jdata)

4. I wrote a script and compared the run times of various codecs (using 
BJData and JSON as containers) , the code can be found here

https://github.com/NeuroJSON/pyjdata/blob/master/test/benchcodecs.py

the save/load times tested on a Ryzen 9 3950X/Ubuntu 18.04 box (at 
various threads) are listed below (similar to your posted before)

*|- Testing npy/npz|*|
||  'npy',        'save' 0.2914195 'load' 0.1963226 'size'  800000128||
||  'npz',        'save' 2.8617918 'load' 1.9550347 'size'  813846|||||

*|- Testing text-based JSON files (.jdt)|**|*|(nthread=8)|*...|*|
||  'zlib',       'save' 2.5132861 'load' 1.7221164 'size'  1084942||
||  'lzma',       'save' 9.5481696 'load' 0.3865211 'size'  150738||
||  'lz4',        'save' 0.3467197 'load' 0.5019965 'size'  4495297||
||  'blosc2blosclz'save' 0.0165646 'load' 0.1143934 'size'  1092747||
||  'blosc2lz4',  'save' 0.0175058 'load' 0.1015181 'size'  1090159||
||  'blosc2lz4hc','save' 0.2102167 'load' 0.1053235 'size'  4315421||
||  'blosc2zlib', 'save' 0.1002635 'load' 0.1188845 'size'  1270252||
||  'blosc2zstd', 'save' 0.0463817 'load' 0.1017909 'size'  253176|
||

*||**|- Testing binary JSON (BJData) files (.jdb) (nthread=8)...|*|
||  'zlib',       'save' 2.4401443 'load' 1.6316463 'size'  813721||
||  'lzma',       'save' 9.3782029 'load' 0.3728334 'size'  113067||
||  'lz4',        'save' 0.3389360 'load' 0.5017435 'size'  3371487||
||  'blosc2blosclz'save' 0.0173912 'load' 0.1042985 'size'  819576||
||  'blosc2lz4',  'save' 0.0133688 'load' 0.1030941 'size'  817635||
||  'blosc2lz4hc','save' 0.1968047 'load' 0.0950071 'size'  3236580||
||  'blosc2zlib', 'save' 0.1023218 'load' 0.1083922 'size'  952705||
||  'blosc2zstd', 'save' 0.0468430 'load' 0.1019175 'size'  189897||||
|||

*||||- Testing binary JSON (BJData) files (.jdb) ||*|*||(nthread=1)|...|*|
|  'blosc2blosclz'save' 0.0883078 'load' 0.2432985 'size'  819576
   'blosc2lz4',  'save' 0.0867996 'load' 0.2394990  'size' 817635
   'blosc2lz4hc','save' 2.4794559 'load' 0.2498981  'size' 3236580
   'blosc2zlib', 'save' 0.7477457 'load' 0.4873921  'size' 952705
   'blosc2zstd', 'save' 0.3435547 'load' 0.3754863  'size' 189897
|
|*||||- Testing binary JSON (BJData) files (.jdb) ||*|*||(nthread=32)|...|*|
||  'blosc2blosclz'save' 0.0197186 'load' 0.1410989  'size'  819576
   'blosc2lz4',  'save' 0.0168068 'load' 0.1414074  'size' 817635
   'blosc2lz4hc','save' 0.0790011 'load' 0.0935394  'size' 3236580
   'blosc2zlib', 'save' 0.0608818 'load' 0.0985531  'size' 952705
   'blosc2zstd', 'save' 0.0370790 'load' 0.0945577  'size' 189897

|

a few observations:

1. single-threaded zlib/lzma are relatively slow, reflected by npz, zlib 
and lzma results

2. for simple data structure like this one, using JSON/text-based 
wrapper vs a binary wrapper has a marginal difference in speed; the only 
penalty is that text/JSON is ~33% larger than binary in size due to base64

3. blosc2 overall delivered very impressive speed - even in single 
thread, it can be than faster than uncompressed npz or other standard 
compression methods

4. several blosc2 compressors scaled well with more threads

5. it is a bit strange that blosc2lz4hc yielded larger file size, 
similar to that from a standard lz4, but blosc2lz4 produces a size 
comparable to zlib; I expected reverted findings, because lz4hc is 
supposed to give "higher-compression"

one question I have is: how stable is your format spec? do you see the 
buffers compressed by your current blosc2 library be still opened/parsed 
by your future releases (at least with an intent to)?

|
|

||
...
Not quite.  Blosc2 does not use the multi-threaded version of zstd; it 
rather implements its own internal multi-threading engine and hence 
all the codecs (and filters) benefit from it, so no need to trust on a 
multi-threaded codec for speed.  Also, as filters execute prior to 
codecs, they can reuse the same internal buffers, avoiding copies 
(which is critical for achieving high I/O performance).
As said, we are not using packed ND array in msgpack, but rather, 
using our own schema.  Blosc2 supports the concept of metalayers for 
adding new meaning to the stored data 
(seehttps://www.blosc.org/docs/Caterva-Blosc2-SciPy2019.pdf, slide 
17).  One of these layers is Caterva, where we have added support 
forMD arrays 
<https://github.com/Blosc/caterva/blob/master/CATERVA_METALAYER.rst>. 
Note that our implementation for supporting ND arrays uses two levels 
of partitioning (chunks and blocks) for:
1. Allowfiner granularity 
<https://www.blosc.org/posts/caterva-slicing-perf/>in retrieving data.
2. Better adapt to the memory hierarchies (i.e. main memory and cache 
levels in CPU)for efficiency 
<https://www.blosc.org/posts/breaking-memory-walls/>.
OTOH, I have noticed thatyour patch for msgpack 
<https://github.com/msgpack/msgpack/pull/267/files#diff-bc6661da34ecae62fbe724bb93fd69b91a7f81143f2683a81163231de7e3b545R334>only 
suggest to use uint32 as the type for array shape.  This would prevent 
to use creating arrays where some dim is larger than 2^32.  Is that 
intended?
see the last part of this post

https://github.com/msgpack/msgpack/issues/268#issuecomment-495050845

in BJData, the ND-array dimensional vector supports different integer 
types 
<https://github.com/NeuroJSON/bjdata/blob/Draft_2/Binary_JData_Specification....>
...
I see your point, and your intent is really appreciated.  It is just 
in the 10's GB and up domain that I see BJData a bit lacking in that 
text handling tools (strings, sed, not to mention editors, where you 
can run out of memory very soon) can become unnecessarily slow for 
retrieving the metainfo.  We really feel that such metainfo should go 
either at the beginning or at the end of the frame, where it can be 
found and processed way more efficiently.
regardless which serialization format is chosen, I think both projects 
see the needs to store hierarchical metadata along-side with the data. I 
agree with you that if reading/searching metadata is desired, 
header&trailer are the best places. For efficient search of metadata 
while accommodating large amount of binary data in scales, 
CouchDB/MongoDB use "attachments" to hold large binary data. The 
metadata tree and the attachment can be linked using a simple UUID or 
JSON-reference string
...
OTOH, I agree in that msgpack is not human readable directly, but the 
format is becoming so ubiquitous that you can find standard tools for 
introspecting metadata quite easily
it would be nice to store the header data in a map so it can be 
self-explanatory (with just a small cost of size). I am even willing go 
as far as adding non-essential metadata that can help make the data file 
as self-explained as possible, such as spec, schemas and parsers, just 
because the format can and it costs almost nothing

https://github.com/rordenlab/dcm2niix/blob/v1.0.20220720/console/nii_dicom_b...
...
:
$ msgpack2json -di eye5_blosc2_blosclz.b2frame
[
...
]
And, as there are msgpack libraries for almost all of the currently 
used languages, I think that formats based on it are as open and 
transparent as we can get.
again, I applaud the wonderful works from the blosc2 team and have
    no doubt it has many advantages to offer to sharing array data, on
    the other side, I do want to advocate for considering readability
    and portability to the data files. Essentially theNeuroJSON specs
    <http://neurojson.org/#specs>(JData
    <https://github.com/NeuroJSON/jdata/blob/Draft_2/JData_specification.md>,BJData
    <https://github.com/NeuroJSON/bjdata/blob/Draft_2/Binary_JData_Specification....>,
    etc) are taking the mission of building a "source-code language"
    for scientific data storage.
Thanks, I concur with your work too!  It is always nice to discuss 
with people that has put a lot of thought in how to pack data 
efficiently, and as simply as possible (but not any simpler!).  
Actually, we might be adopting some aspects ofJData 
<https://github.com/fangq/jdata>to be able to store different objects 
(arrays, tables, graphs, trees...) in the same frame in a future 
possible extension of Blosc2.  Or, maybe using JData as the external 
container for existing Blosc2 frames.  Very interesting discussion 
indeed; many possibilities are open now!
will be absolutely happy to explore collaboration possibilities. will 
reach out offline.

Qianqian
...
Cheers,
Francesc

[Numpy-discussion] Re: Exporting numpy arrays to binary JSON (BJData) for better portability

Qianqian Fang