[Numpy-discussion] Re: Exporting numpy arrays to binary JSON (BJData) for better portability

Aug. 25, 2022

      On 8/25/22 18:33, Neal Becker wrote:
...
...
the loading time (from an nvme drive, Ubuntu 18.04, python 3.6.9,
    numpy 1.19.5) for each file is listed below:
|0.179s  eye1e4.npy (mmap_mode=None)||
    ||0.001s  eye1e4.npy (mmap_mode=r)||
    ||0.718s  eye1e4_bjd_raw_ndsyntax.jdb||
    ||1.474s  eye1e4_bjd_zlib.jdb||
    ||0.635s  eye1e4_bjd_lzma.jdb|
clearly, mmapped loading is the fastest option without a
    surprise; it is true that the raw bjdata file is about 5x slower
    than npy loading, but given the main chunk of the data are stored
    identically (as contiguous buffer), I suppose with some
    optimization of the decoder, the gap between the two can be
    substantially shortened. The longer loading time of zlib/lzma
    (and similarly saving times) reflects a trade-off between smaller
    file sizes and time for compression/decompression/disk-IO.
I think the load time for mmap may be deceptive, it isn't actually
    loading anything, just mapping to memory.  Maybe a better
    benchmark is to actually process the data, e.g., find the mean
    which would require reading the values.
yes, that is correct, I meant to metion it wasn't an apple-to-apple 
comparison.

the loading times for fully-loading the data and printing the mean, by 
running the below line

|t=time.time(); newy=jd.load('eye1e4_bjd_raw_ndsyntax.jdb'); 
print(np.mean(newy)); t1=time.time() - t; print(t1)|

are summarized below (I also added lz4 compressed BJData/.jdb file via 
|jd.save(..., {'compression':'lz4'})|)

|0.236s  eye1e4.npy (mmap_mode=None)||- size: 800000128 bytes
||0.120s  eye1e4.npy (mmap_mode=r)||
||0.764s  eye1e4_bjd_raw_ndsyntax.jdb||(with C extension _bjdata in 
sys.path) - size: 800000014 bytes|
||0.599s  eye1e4_bjd_raw_ndsyntax.jdb||(without C extension _bjdata)|
||1.533s  eye1e4_bjd_zlib.jdb|||(without C extension _bjdata)|||  - 
size: 813721
||0.697s  eye1e4_bjd_lzma.jdb|||(without C extension _bjdata)  - size: 
113067
|||||0.918s  eye1e4_bjd_lz4.jdb|||(without C extension _bjdata)   - 
size: 3371487 bytes||
||||

the mmapped loading remains to be the fastest, but the run-time is more 
realistic. I thought the lz4 compression would offer much faster 
decompression, but in this special workload, it isn't the case.

It is also interesting to see that the bjdata's C extension 
<https://github.com/NeuroJSON/pybj/tree/master/src> did not help when 
parsing a single large array compared to the native python parser, 
suggesting rooms for further optimization|.|||
||

||
||

||Qianqian||

||
||

[Numpy-discussion] Re: Exporting numpy arrays to binary JSON (BJData) for better portability

Qianqian Fang