On Thu, Aug 25, 2022 at 3:47 PM Qianqian Fang <fangqq@gmail.com> wrote:
On 8/25/22 12:25, Robert Kern wrote:
I don't quite know what this means. My installed version of `jq`, for example, doesn't seem to know what to do with these files.

❯ jq --version

❯ jq . eye5chunk_bjd_raw.jdb
parse error: Invalid numeric literal at line 1, column 38

the .jdb files are binary JSON files (specifically BJData) that jq does not currently support; to save as text-based JSON, you change the suffix to .json or .jdt - it results in ~33% increase compared to the binary due to base64

Okay. Given your wording, it looked like you were claiming that the binary JSON was supported by the whole ecosystem. Rather, it seems like you can either get binary encoding OR the ecosystem support, but not both at the same time.
I think a fundamental problem here is that it looks like each element in the array is delimited. I.e. a `float64` value starts with b'D' then the 8 IEEE-754 bytes representing the number. When we're talking about memory-mappability, we are talking about having the on-disk representation being exactly what it looks like in-memory, all of the IEEE-754 floats contiguous with each other, so we can use the `np.memmap` `ndarray` subclass to represent the on-disk data as a first-class array object. This spec lets us mmap the binary JSON file and manipulate its contents in-place efficiently, but that's not what is being asked for here.

there are several BJData-compliant forms to store the same binary array losslessly. The most memory efficient and disk-mmapable (but not necessarily disk-efficient) form is to use the ND-array container syntax that BJData spec extended over UBJSON.

Are any of them supported by a Python BJData implementation? I didn't see any option to get that done in the `bjdata` package you recommended, for example.


Robert Kern