No one is really proposing another format, just a minor tweak to the existing NPY format.
agreed. I was just following the previous comment on alternative
formats (such as hdf5) and pros/cons of npy.
I don't quite know what this means. My installed version of `jq`, for example, doesn't seem to know what to do with these files.❯ jq --version
jq-1.6
❯ jq . eye5chunk_bjd_raw.jdb
parse error: Invalid numeric literal at line 1, column 38
the .jdb files are binary JSON files (specifically BJData) that
jq does not currently support; to save as text-based JSON, you
change the suffix to .json or .jdt - it results in ~33% increase
compared to the binary due to base64
jd.save(y, 'eye5chunk_bjd_zlib.jdt', {'compression':'zlib'});
13694 Aug 25 12:54 eye5chunk_bjd_zlib.jdt
10338 Aug 25 15:41 eye5chunk_bjd_zlib.jdb
jq . eye5chunk_bjd_zlib.jdt
[
{
"_ArrayType_": "double",
"_ArraySize_": [
200,
1000
],
"_ArrayZipType_": "zlib",
"_ArrayZipSize_": [
1,
200000
],
"_ArrayZipData_": "..."
},
...
]
I think a fundamental problem here is that it looks like each element in the array is delimited. I.e. a `float64` value starts with b'D' then the 8 IEEE-754 bytes representing the number. When we're talking about memory-mappability, we are talking about having the on-disk representation being exactly what it looks like in-memory, all of the IEEE-754 floats contiguous with each other, so we can use the `np.memmap` `ndarray` subclass to represent the on-disk data as a first-class array object. This spec lets us mmap the binary JSON file and manipulate its contents in-place efficiently, but that's not what is being asked for here.
there are several BJData-compliant forms to store the same binary array losslessly. The most memory efficient and disk-mmapable (but not necessarily disk-efficient) form is to use the ND-array container syntax that BJData spec extended over UBJSON.
For example, a 100x200x300 3D float64 ($D) array can be stored as below (numbers are stored in binary forms, white spaces should be removed)
[$D #[$u#U3 100 200 300 value0 value1 ...
where the "value_i"s are contiguous (row-major) binary stream of
the float64 buffer without the delimited marker ('D') because it
is absorbed to the optimized
header of the array "[" following the type "$" marker. The
data chunk is mmap-able, although if you desired a pre-determined
initial offset, you can force the dimension vector (#[$u #U 3 100
200 300) to be an integer type ($u) large enough, for example
uint32 (m), then the starting offset of the binary stream will be
entirely predictable.
multiple ND arrays can be directly appended to the root level,
for example,
[$D #[$u#U3 100 200 300 value0 value1 ...
[$D #[$u#U3 100 200 300 value0 value1 ...
[$D #[$u#U3 100 200 300 value0 value1 ...
[$D #[$u#U3 100 200 300 value0 value1 ...
can store 100x200x300 chunks of a 400x200x300 array
alternatively, one can also use an annotated format (in JSON
form: {"_ArrayType":"double","_ArraySize_":[100,200,300],"_ArrayData_":[value1,value2,...]}
)
to store everything into 1D continuous buffer
{
U11 _ArrayType_ S U6 double
U11
_ArraySize_ [$u#U3 100 200 300 U11 _ArrayData_ [$D #m 6000000
value1 value2 ...}
The contiguous buffer in _ArrayData_ section is also
disk-mmap-able; you can also make additional requirements for the
array metadata to ensure a predictable initial offset, if desired.
similarly, these annotated chunks can be appended in either JSON
or binary JSON forms, and the parsers can automatically handle
both forms and convert them into the desired binary ND array with
the expected type and dimensions.
here are the output file sizes in bytes:
8000128 eye5chunk.npy
5004297 eye5chunk_bjd_raw.jdb
Just a note: This difference is solely due to a special representation of `0` in 5 bytes rather than 8 (essentially, your encoder recognizes 0.0 as a special value and uses the `float32` encoding of it). If you had any other value making up the bulk of the file, this would be larger than the NPY due to the additional delimiter b'D'.
the two BJData forms that I mentioned above (nd-array syntax or
annotated array) will preserve the original precision/shape in
round-trips. BJData follows the recommendations of the UBJSON spec
and automatically
reduces data size only if no precision loss (such as integer
or zeros), but it is optional.
--
10338 eye5chunk_bjd_zlib.jdb
2206 eye5chunk_bjd_lzma.jdb
Qianqian
Robert Kern
_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-leave@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: fangqq@gmail.com