On 8/25/22 12:25, Robert Kern wrote:
No one is really proposing another format, just a minor tweak to the existing NPY format.

agreed. I was just following the previous comment on alternative formats (such as hdf5) and pros/cons of npy.

I don't quite know what this means. My installed version of `jq`, for example, doesn't seem to know what to do with these files.

❯ jq --version

❯ jq . eye5chunk_bjd_raw.jdb
parse error: Invalid numeric literal at line 1, column 38

the .jdb files are binary JSON files (specifically BJData) that jq does not currently support; to save as text-based JSON, you change the suffix to .json or .jdt - it results in ~33% increase compared to the binary due to base64

jd.save(y, 'eye5chunk_bjd_zlib.jdt',  {'compression':'zlib'});

13694 Aug 25 12:54 eye5chunk_bjd_zlib.jdt
10338 Aug 25 15:41 eye5chunk_bjd_zlib.jdb

jq . eye5chunk_bjd_zlib.jdt

    "_ArrayType_": "double",
    "_ArraySize_": [
    "_ArrayZipType_": "zlib",
    "_ArrayZipSize_": [
    "_ArrayZipData_": "..."


I think a fundamental problem here is that it looks like each element in the array is delimited. I.e. a `float64` value starts with b'D' then the 8 IEEE-754 bytes representing the number. When we're talking about memory-mappability, we are talking about having the on-disk representation being exactly what it looks like in-memory, all of the IEEE-754 floats contiguous with each other, so we can use the `np.memmap` `ndarray` subclass to represent the on-disk data as a first-class array object. This spec lets us mmap the binary JSON file and manipulate its contents in-place efficiently, but that's not what is being asked for here.

there are several BJData-compliant forms to store the same binary array losslessly. The most memory efficient and disk-mmapable (but not necessarily disk-efficient) form is to use the ND-array container syntax that BJData spec extended over UBJSON.

For example, a 100x200x300 3D float64 ($D) array can be stored as below (numbers are stored in binary forms, white spaces should be removed)

[$D #[$u#U3 100 200 300 value0 value1 ...

where the "value_i"s are contiguous (row-major) binary stream of the float64 buffer without the delimited marker ('D') because it is absorbed to the optimized header of the array "[" following the type "$" marker. The data chunk is mmap-able, although if you desired a pre-determined initial offset, you can force the dimension vector (#[$u #U 3 100 200 300) to be an integer type ($u) large enough, for example uint32 (m), then the starting offset of the binary stream will be entirely predictable.

multiple ND arrays can be directly appended to the root level, for example,

[$D #[$u#U3 100 200 300 value0 value1 ...
[$D #[$u#U3 100 200 300 value0 value1 ...
[$D #[$u#U3 100 200 300 value0 value1 ...
[$D #[$u#U3 100 200 300 value0 value1 ...

can store 100x200x300 chunks of a 400x200x300 array

alternatively, one can also use an annotated format (in JSON form: {"_ArrayType":"double","_ArraySize_":[100,200,300],"_ArrayData_":[value1,value2,...]}) to store everything into 1D continuous buffer

{U11 _ArrayType_ S U6 double U11 _ArraySize_ [$u#U3 100 200 300 U11 _ArrayData_ [$D #m 6000000 value1 value2 ...}

The contiguous buffer in _ArrayData_ section is also disk-mmap-able; you can also make additional requirements for the array metadata to ensure a predictable initial offset, if desired.

similarly, these annotated chunks can be appended in either JSON or binary JSON forms, and the parsers can automatically handle both forms and convert them into the desired binary ND array with the expected type and dimensions.

here are the output file sizes in bytes:

8000128  eye5chunk.npy
5004297  eye5chunk_bjd_raw.jdb

Just a note: This difference is solely due to a special representation of `0` in 5 bytes rather than 8 (essentially, your encoder recognizes 0.0 as a special value and uses the `float32` encoding of it). If you had any other value making up the bulk of the file, this would be larger than the NPY due to the additional delimiter b'D'.

the two BJData forms that I mentioned above (nd-array syntax or annotated array) will preserve the original precision/shape in round-trips. BJData follows the recommendations of the UBJSON spec and automatically reduces data size only if no precision loss (such as integer or zeros), but it is optional.

  10338  eye5chunk_bjd_zlib.jdb
   2206  eye5chunk_bjd_lzma.jdb


Robert Kern

NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-leave@python.org
Member address: fangqq@gmail.com