To avoid derailing the
other thread on extending .npy files, I am going to start a
new thread on alternative array storage file formats using binary
JSON - in case there is such a need and interest among numpy users
specifically, i want to first follow up with Bill's question below
regarding loading time
On 8/25/22 11:02, Bill Ross wrote:
Can you give load times for these?
as I mentioned in the earlier reply to Robert, the most
memory-efficient (i.e. fast loading, disk-mmap-able) but not
necessarily disk-efficient (i.e. may result in the largest data file
sizes) BJData construct to store an ND array using BJData's ND-array
container.
I have to admit that both jdata and bjdata modules have not been
extensively optimized for speed. with the current implementation,
here are the loading time for a larger diagonal matrix (eye(10000))
a BJData file storing a single eye(10000) array using the ND-array
container can be downloaded
from here (file size: 1MB with zip, if decompressed, it is
~800MB, as the npy file) - this file was generated from a matlab
encoder, but can be loaded using Python (see below Re Robert).
800000128 eye1e4.npy
800000014 eye1e4_bjd_raw_ndsyntax.jdb
813721 eye1e4_bjd_zlib.jdb
113067 eye1e4_bjd_lzma.jdb
the loading time (from an nvme drive, Ubuntu 18.04, python 3.6.9,
numpy 1.19.5) for each file is listed below:
0.179s eye1e4.npy (mmap_mode=None)
0.001s eye1e4.npy (mmap_mode=r)
0.718s eye1e4_bjd_raw_ndsyntax.jdb
1.474s eye1e4_bjd_zlib.jdb
0.635s eye1e4_bjd_lzma.jdb
clearly, mmapped loading is the fastest option without a surprise;
it is true that the raw bjdata file is about 5x slower than npy
loading, but given the main chunk of the data are stored identically
(as contiguous buffer), I suppose with some optimization of the
decoder, the gap between the two can be substantially shortened. The
longer loading time of zlib/lzma (and similarly saving times)
reflects a trade-off between smaller file sizes and time for
compression/decompression/disk-IO.
by no means I am saying the binary JSON format is readily available
to deliver better speed with its current non-optimized
implementation. I just want to bright the attention to this class of
formats, and highlight that it's flexibility gives abundant
mechanisms to create fast, disk-mapped IO, while allowing additional
benefits such as compression, unlimited metadata for future
extensions etc.
> 8000128 eye5chunk.npy
> 5004297 eye5chunk_bjd_raw.jdb
> 10338 eye5chunk_bjd_zlib.jdb
> 2206 eye5chunk_bjd_lzma.jdb
For my case, I'd be curious about the time to add one
1T-entries file to another.
as I mentioned in the previous reply, bjdata is appendable,
so you can simply append another array (or a slice) to the end of
the file.
Thanks,
Bill
also related, Re @Robert's question
below
the bjdata module currently only
support nd-array in the
decoder
(i.e. map such buffer to a numpy.ndarray) - should be relatively
trivial to add it to the encoder though.
on the other side, the annotated format
is currently supported. one can call jdata module (responsible for
annotation-level encoding/decoding) as shown in my sample code,
then call bjdata internally for data serialization.
Okay. Given your wording, it looked like you were claiming
that the binary JSON was supported by the whole ecosystem.
Rather, it seems like you can either get binary encoding OR
the ecosystem support, but not both at the same time.
all in relative terms of course - json
has ~100 listed parsers on it's
website,
MessagePack - another flavor of binary JSON -
listed
~50/60 parsers, and UBJSON
listed ~20 parsers. I
am not familiar with npy parsers, but googling it returns only a
few.
also, most binary JSON implementations
provided tools to convert to JSON and back, so, in that sense,
whatever JSON has in its ecosystem can be "potentially" used for
binary JSON files if one wants to. There are also recent
publications comparing differences between various binary JSON
formats in case anyone is interested