JSON format for multi-dimensional data
![](https://secure.gravatar.com/avatar/7179d7d34755e811514712ae8b5c9690.jpg?s=120&d=mm&r=g)
Hi community, This memo is a proposal to implement a compact and reversible (lossless round-trip) JSON interface for multi-dimensional data and in particular for Numpy (see issue #12481). The links to the documents are at the end of the memo. The JSON-NTV (Named and Typed value) format is a JSON format which integrates a notion of type. This format has also been implemented for tabular data (see NTV-pandas package available in the pandas ecosystem and the PDEP12 specification). . The use of this format has the following advantages: - Taking into account data types not known to Numpy, - Reversible format (lossless round-trip) - Interoperability with other tools for tabular or multi-dimensional data (e.g. pandas, Xarray) - Ease of sharing Json format - Binary coding possible (e.g. CBOR format) - Format integrating data of different nature The associated Jupyter Notebook presents some key points of this proposal (first draft): Summary: - introduction - benefits - multi-dimensionnal data - Multi-dimensional types - Format JSON - Using the NTV format - Equivalence of tabular format and multidimensional format - Astropy specific points - Units and quantities - Coordinates - Tables - Other structures This subject seems important to me (in particular for interoperability issues) and I would like to have your feedback before working on the implementation. Especially, - do you think this “semantic” format is interesting to use? - do you have any particular expectations or subjects that I need to study beforehand? - do you have any examples or test cases to offer me? And of course, any type of remark and comment is welcome. Thanks in advance ! links: - Jupyter notebook : https://nbviewer.org/github/loco-philippe/Environmental-Sensing/blob/main/py... - JSON-NTV format : https://www.ietf.org/archive/id/draft-thomy-json-ntv-02.html - JSON-NTV overview : https://nbviewer.org/github/loco-philippe/NTV/blob/main/example/example_ntv.... - NTV tabular format : https://www.ietf.org/archive/id/draft-thomy-ntv-tab-00.html#name-tabular-str... - NTV-pandas package : https://github.com/loco-philippe/ntv-pandas/blob/main/README.md - NTV-pandas examples : https://nbviewer.org/github/loco-philippe/ntv-pandas/blob/main/example/examp... - Pandas specification - PDEP12 : https://pandas.pydata.org/pdeps/0012-compact-and-reversible-JSON-interface.h...
![](https://secure.gravatar.com/avatar/72f994ca072df3a3d2c3db8a137790fd.jpg?s=120&d=mm&r=g)
On 20/2/24 01:24, philippe@loco-labs.io wrote:
There is an open issue [1] about such a format, is this the same or different? We discussed this at the latest triage meeting. While interoperability is one of NumPy's goals, and something we care deeply about, we were not sure how this initiative will play out. Perhaps, like the Pandas package, it should live outside NumPy for a while until some wider consensus could emerge. We did have a few questions about the standard: - How does it handle sharing data? NumPy can handle very large ndarrays, and a read-only container with a shared memory location, like in DLPack [0] seems more natural than a format that precludes sharing data. - Is there a size limitation either on the data or on the number of dimensions? Could this format represent, for instance, data with more than 100 dimensions, which could not be mapped back to NumPy. Matti [0] https://dmlc.github.io/dlpack/latest/ [1] https://github.com/numpy/numpy/issues/12481
![](https://secure.gravatar.com/avatar/7179d7d34755e811514712ae8b5c9690.jpg?s=120&d=mm&r=g)
Thank you Matti for this response. I completed issue 12481 because in my opinion the format proposal responds to this issue. However, if you think a specific issue is preferable, I can create it. To fully understand the proposed standard, it involves representing multidimensional data that has any type. The only constraint is that each data can be represented by a JSON format. This is of course the case for all pandas types but it can also be one of the following types: a year, a polygon, a URI, a type defined in darwincore or in schemaorg... This means that each library or framework must transform this JSON data into an internal value (e.g. a polygon can be translated into a shapely object). The defined types are described in the NTV Internet-Draft [2].
Concerning the first question, the purpose of this standard is complementary to what is proposed by DLPack (DLPack offers standard access mechanisms to data in memory, which avoids duplication between frameworks): - the format is a neutral reversible exchange format built on JSON (and therefore with duplication) which can be used independently of any framework. - the data types are numerous and with a broader scope than that offered by DLPack (numeric types only).
Regarding the second question, no there is no limitation on data size or dimensions linked to the format (JSON does not impose limits on array sizes).
Perhaps, like the Pandas package, it should live outside NumPy for a while until some wider consensus could emerge.
Regarding this initial remark, this is indeed a possible option but it depends on the answer to the question: - does Numpy want to have a neutral JSON exchange format to exchange data with other frameworks (tabular, multidimensional or other)? This is why I am interested in having a better understanding of the needs (see end of the initial email). [2] https://www.ietf.org/archive/id/draft-thomy-json-ntv-02.html#appendix-A
![](https://secure.gravatar.com/avatar/5f88830d19f9c83e2ddfd913496c5025.jpg?s=120&d=mm&r=g)
On Sun, Feb 25, 2024 at 12:34 AM <philippe@loco-labs.io> wrote:
I'd say it's unlikely. There are a lot of data storage formats; NumPy has support for almost none of them, and for the few that we do have support for (e.g. CSV) the reason for having that inside of NumPy is mostly historical. There are packages like Zarr, h5py, PyTables, scipy.io that implement support for reading and writing NumPy arrays in a large number of I/O formats. Typically there is no reason for such code to live inside NumPy. I'd expect the same to be true for JSON. That isn't to say that a new JSON-based storage format wouldn't be of interest to NumPy users - they may very well need it. We do have docs that mention popular I/O formats, and if yours gets popular we may want to add it to those docs: https://numpy.org/devdocs/user/how-to-io.html#write-or-read-large-arrays (that could use more detail too). Cheers, Ralf
![](https://secure.gravatar.com/avatar/7179d7d34755e811514712ae8b5c9690.jpg?s=120&d=mm&r=g)
Thanks Ralf, This answers my question about the absence of I/O Numpy format. There are three other points related to this format proposal: - integration of a semantic level above the number / character formats as for datetime (e.g. units, point / polygon, URI, email, IP, encoding...), - neutral format (platform independent) for multidimensional data including multi-variables, axes, indexes and metadata, - finally the conversion of tabular data into multi-dimensional data (dimension greater than 2) via a neutral format. Do these points interest Numpy or would this rather concern applications built on a Numpy base?
![](https://secure.gravatar.com/avatar/0bba44329840b359964531d1f6237956.jpg?s=120&d=mm&r=g)
aside from the previously mentioned ticket https://github.com/numpy/numpy/issues/12481, I also made a similar proposal, posted in 2021 https://github.com/numpy/numpy/issues/20461 https://mail.python.org/archives/list/numpy-discussion@python.org/message/EV... lightweight JSON annotations for various data structures (trees, tables, graphs), especially ND-arrays, are defined in the JData spec https://github.com/NeuroJSON/jdata/blob/master/JData_specification.md#data-a... JSON/binary JSON annotation encoders/decoders have been implemented for Python (https://pypi.org/project/jdata/), MATLAB/Octave (https://github.com/fangq/jsonlab), JavaScript/NodeJS (https://www.npmjs.com/package/jda), as well as C++ (JSON for Modern C++, https://json.nlohmann.me/features/binary_formats/bjdata/) I have been extensively used this annotation in JSON/binary JSON in my neuroimaging data portal, https://neurojson.io/, for example, for 3D data https://neurojson.org/db/fieldtrip(atlas)/FieldTrip--Brainnetome--Brainnetom... https://neurojson.org/db/fieldtrip(atlas)/FieldTrip--Brainnetome--Brainnetom... for mesh data https://neurojson.org/db/brainmeshlibrary/BrainWeb--Subject04--gm--surf https://neurojson.org/db/brainmeshlibrary/BrainWeb--Subject04--gm--surf#prev... the ND array supports binary data with loss-less compression. I've also implemented in a renewed thread posted in 2022, I also tested the blosc2 (https://www.blosc.org/) compression codecs and got excellent read/write speed https://mail.python.org/archives/list/numpy-discussion@python.org/thread/JIT... https://mail.python.org/archives/list/numpy-discussion@python.org/message/TU... the blosc2 compression codes are supported in my python and matlab/C parsers. Qianqian
![](https://secure.gravatar.com/avatar/7179d7d34755e811514712ae8b5c9690.jpg?s=120&d=mm&r=g)
Bravo for this very comprehensive work which covers all technical/scientific data structures well ! I think we share the same goal of improving interoperability through the use of neutral formats. However, I see some differences: - I focus efforts more particularly on increasing the semantic level with a generalization and an extension of the notion of type, - I also try not to call into question what works well so that the impacts are minimal. For example, we can have a mixed JSON structure integrating a part of data in NTV format and another part outside it. Likewise for tabular data, we can go from a "format" type to a "semantic" type without significant impact for a tool like Pandas. More particularly, concerning multidimensional data, it seems to me that it is necessary not to limit oneself to the ndarray structure but that it is also necessary to integrate associated structures such as those defined in Xarray.
![](https://secure.gravatar.com/avatar/8dc4d0010b050a4bae0b632defb5c4f3.jpg?s=120&d=mm&r=g)
I sure like the idea of this. I agree that this should be external to numpy. At least until it becomes a standard in a sense that json itself is. And that n-dimensional array should ideally be extended to indexed structures with named dimensions and attributes. To accommodate a superset of: xarray & scipp. Regards, DG
![](https://secure.gravatar.com/avatar/7179d7d34755e811514712ae8b5c9690.jpg?s=120&d=mm&r=g)
Thank you dom for this encouraging comment ! I agree with these remarks. I will indeed integrate the extensions made by scipp to Xarray. Note: I am also looking for feedback regarding the analysis of tabular structures (e.g. to identify the hidden multidimensional structure): https://github.com/loco-philippe/tab-analysis/blob/main/docs/tabular_analysi.... pdf. Do you think this might be of interest to scipp or Xarray?
![](https://secure.gravatar.com/avatar/8dc4d0010b050a4bae0b632defb5c4f3.jpg?s=120&d=mm&r=g)
Could be of interest to scipp. They already put work to link to https://www.nexusformat.org. Maybe they would be interested in JData as well. — Also, what could be interesting is "format independent standard”. So that there is a general standard of data structures, which is adapted by different data formats, which are in turn implemented in different languages. Something like SansIO concept in the world of protocols. — Benefits: a) This way, if format obeys the standard, I would be certain that directed graph has node, edge, and graph attributes. b) Also, selling this to different libraries would be easier. Numpy would only need to implement 2 methods (to/from), which can be used by different formats and libraries. c) Another benefit of this would be that it would be possible to skip “format” step to convert between different libraries. E.g. xarray and scipp. d) Finally, if I decide to make my own data class, say graph. I only need 2 methods to be able to convert it to any other library. — I would happily be part of such project. Regards, DG
![](https://secure.gravatar.com/avatar/7179d7d34755e811514712ae8b5c9690.jpg?s=120&d=mm&r=g)
Hello, I created a first version of a neutral format for multi-dimensional data (https://nbviewer.org/github/loco-philippe/ntv-numpy/blob/main/example/exampl... ) and I made available a first version of a package (https://github.com/loco-philippe/ntv-numpy/blob/main/README.md) with: - a reversible (lossless round-trip) Xarray interface, - a reversible scipp interface - a reversible astropy.NDData interface - a reversible JSON interface The previous Notebook shows that we can, thanks to this neutral format, share any dataset with any tool. I will integrate in a second version the existing structure for tabular data (https://github.com/loco-philippe/ntv-pandas/blob/main/README.md) and the associated reversible interface . If you have examples of other tools to integrate or validation datasets, I'm interested! Have a nice day
![](https://secure.gravatar.com/avatar/72f994ca072df3a3d2c3db8a137790fd.jpg?s=120&d=mm&r=g)
On 20/2/24 01:24, philippe@loco-labs.io wrote:
There is an open issue [1] about such a format, is this the same or different? We discussed this at the latest triage meeting. While interoperability is one of NumPy's goals, and something we care deeply about, we were not sure how this initiative will play out. Perhaps, like the Pandas package, it should live outside NumPy for a while until some wider consensus could emerge. We did have a few questions about the standard: - How does it handle sharing data? NumPy can handle very large ndarrays, and a read-only container with a shared memory location, like in DLPack [0] seems more natural than a format that precludes sharing data. - Is there a size limitation either on the data or on the number of dimensions? Could this format represent, for instance, data with more than 100 dimensions, which could not be mapped back to NumPy. Matti [0] https://dmlc.github.io/dlpack/latest/ [1] https://github.com/numpy/numpy/issues/12481
![](https://secure.gravatar.com/avatar/7179d7d34755e811514712ae8b5c9690.jpg?s=120&d=mm&r=g)
Thank you Matti for this response. I completed issue 12481 because in my opinion the format proposal responds to this issue. However, if you think a specific issue is preferable, I can create it. To fully understand the proposed standard, it involves representing multidimensional data that has any type. The only constraint is that each data can be represented by a JSON format. This is of course the case for all pandas types but it can also be one of the following types: a year, a polygon, a URI, a type defined in darwincore or in schemaorg... This means that each library or framework must transform this JSON data into an internal value (e.g. a polygon can be translated into a shapely object). The defined types are described in the NTV Internet-Draft [2].
Concerning the first question, the purpose of this standard is complementary to what is proposed by DLPack (DLPack offers standard access mechanisms to data in memory, which avoids duplication between frameworks): - the format is a neutral reversible exchange format built on JSON (and therefore with duplication) which can be used independently of any framework. - the data types are numerous and with a broader scope than that offered by DLPack (numeric types only).
Regarding the second question, no there is no limitation on data size or dimensions linked to the format (JSON does not impose limits on array sizes).
Perhaps, like the Pandas package, it should live outside NumPy for a while until some wider consensus could emerge.
Regarding this initial remark, this is indeed a possible option but it depends on the answer to the question: - does Numpy want to have a neutral JSON exchange format to exchange data with other frameworks (tabular, multidimensional or other)? This is why I am interested in having a better understanding of the needs (see end of the initial email). [2] https://www.ietf.org/archive/id/draft-thomy-json-ntv-02.html#appendix-A
![](https://secure.gravatar.com/avatar/5f88830d19f9c83e2ddfd913496c5025.jpg?s=120&d=mm&r=g)
On Sun, Feb 25, 2024 at 12:34 AM <philippe@loco-labs.io> wrote:
I'd say it's unlikely. There are a lot of data storage formats; NumPy has support for almost none of them, and for the few that we do have support for (e.g. CSV) the reason for having that inside of NumPy is mostly historical. There are packages like Zarr, h5py, PyTables, scipy.io that implement support for reading and writing NumPy arrays in a large number of I/O formats. Typically there is no reason for such code to live inside NumPy. I'd expect the same to be true for JSON. That isn't to say that a new JSON-based storage format wouldn't be of interest to NumPy users - they may very well need it. We do have docs that mention popular I/O formats, and if yours gets popular we may want to add it to those docs: https://numpy.org/devdocs/user/how-to-io.html#write-or-read-large-arrays (that could use more detail too). Cheers, Ralf
![](https://secure.gravatar.com/avatar/7179d7d34755e811514712ae8b5c9690.jpg?s=120&d=mm&r=g)
Thanks Ralf, This answers my question about the absence of I/O Numpy format. There are three other points related to this format proposal: - integration of a semantic level above the number / character formats as for datetime (e.g. units, point / polygon, URI, email, IP, encoding...), - neutral format (platform independent) for multidimensional data including multi-variables, axes, indexes and metadata, - finally the conversion of tabular data into multi-dimensional data (dimension greater than 2) via a neutral format. Do these points interest Numpy or would this rather concern applications built on a Numpy base?
![](https://secure.gravatar.com/avatar/0bba44329840b359964531d1f6237956.jpg?s=120&d=mm&r=g)
aside from the previously mentioned ticket https://github.com/numpy/numpy/issues/12481, I also made a similar proposal, posted in 2021 https://github.com/numpy/numpy/issues/20461 https://mail.python.org/archives/list/numpy-discussion@python.org/message/EV... lightweight JSON annotations for various data structures (trees, tables, graphs), especially ND-arrays, are defined in the JData spec https://github.com/NeuroJSON/jdata/blob/master/JData_specification.md#data-a... JSON/binary JSON annotation encoders/decoders have been implemented for Python (https://pypi.org/project/jdata/), MATLAB/Octave (https://github.com/fangq/jsonlab), JavaScript/NodeJS (https://www.npmjs.com/package/jda), as well as C++ (JSON for Modern C++, https://json.nlohmann.me/features/binary_formats/bjdata/) I have been extensively used this annotation in JSON/binary JSON in my neuroimaging data portal, https://neurojson.io/, for example, for 3D data https://neurojson.org/db/fieldtrip(atlas)/FieldTrip--Brainnetome--Brainnetom... https://neurojson.org/db/fieldtrip(atlas)/FieldTrip--Brainnetome--Brainnetom... for mesh data https://neurojson.org/db/brainmeshlibrary/BrainWeb--Subject04--gm--surf https://neurojson.org/db/brainmeshlibrary/BrainWeb--Subject04--gm--surf#prev... the ND array supports binary data with loss-less compression. I've also implemented in a renewed thread posted in 2022, I also tested the blosc2 (https://www.blosc.org/) compression codecs and got excellent read/write speed https://mail.python.org/archives/list/numpy-discussion@python.org/thread/JIT... https://mail.python.org/archives/list/numpy-discussion@python.org/message/TU... the blosc2 compression codes are supported in my python and matlab/C parsers. Qianqian
![](https://secure.gravatar.com/avatar/7179d7d34755e811514712ae8b5c9690.jpg?s=120&d=mm&r=g)
Bravo for this very comprehensive work which covers all technical/scientific data structures well ! I think we share the same goal of improving interoperability through the use of neutral formats. However, I see some differences: - I focus efforts more particularly on increasing the semantic level with a generalization and an extension of the notion of type, - I also try not to call into question what works well so that the impacts are minimal. For example, we can have a mixed JSON structure integrating a part of data in NTV format and another part outside it. Likewise for tabular data, we can go from a "format" type to a "semantic" type without significant impact for a tool like Pandas. More particularly, concerning multidimensional data, it seems to me that it is necessary not to limit oneself to the ndarray structure but that it is also necessary to integrate associated structures such as those defined in Xarray.
![](https://secure.gravatar.com/avatar/8dc4d0010b050a4bae0b632defb5c4f3.jpg?s=120&d=mm&r=g)
I sure like the idea of this. I agree that this should be external to numpy. At least until it becomes a standard in a sense that json itself is. And that n-dimensional array should ideally be extended to indexed structures with named dimensions and attributes. To accommodate a superset of: xarray & scipp. Regards, DG
![](https://secure.gravatar.com/avatar/7179d7d34755e811514712ae8b5c9690.jpg?s=120&d=mm&r=g)
Thank you dom for this encouraging comment ! I agree with these remarks. I will indeed integrate the extensions made by scipp to Xarray. Note: I am also looking for feedback regarding the analysis of tabular structures (e.g. to identify the hidden multidimensional structure): https://github.com/loco-philippe/tab-analysis/blob/main/docs/tabular_analysi.... pdf. Do you think this might be of interest to scipp or Xarray?
![](https://secure.gravatar.com/avatar/8dc4d0010b050a4bae0b632defb5c4f3.jpg?s=120&d=mm&r=g)
Could be of interest to scipp. They already put work to link to https://www.nexusformat.org. Maybe they would be interested in JData as well. — Also, what could be interesting is "format independent standard”. So that there is a general standard of data structures, which is adapted by different data formats, which are in turn implemented in different languages. Something like SansIO concept in the world of protocols. — Benefits: a) This way, if format obeys the standard, I would be certain that directed graph has node, edge, and graph attributes. b) Also, selling this to different libraries would be easier. Numpy would only need to implement 2 methods (to/from), which can be used by different formats and libraries. c) Another benefit of this would be that it would be possible to skip “format” step to convert between different libraries. E.g. xarray and scipp. d) Finally, if I decide to make my own data class, say graph. I only need 2 methods to be able to convert it to any other library. — I would happily be part of such project. Regards, DG
![](https://secure.gravatar.com/avatar/7179d7d34755e811514712ae8b5c9690.jpg?s=120&d=mm&r=g)
Hello, I created a first version of a neutral format for multi-dimensional data (https://nbviewer.org/github/loco-philippe/ntv-numpy/blob/main/example/exampl... ) and I made available a first version of a package (https://github.com/loco-philippe/ntv-numpy/blob/main/README.md) with: - a reversible (lossless round-trip) Xarray interface, - a reversible scipp interface - a reversible astropy.NDData interface - a reversible JSON interface The previous Notebook shows that we can, thanks to this neutral format, share any dataset with any tool. I will integrate in a second version the existing structure for tabular data (https://github.com/loco-philippe/ntv-pandas/blob/main/README.md) and the associated reversible interface . If you have examples of other tools to integrate or validation datasets, I'm interested! Have a nice day
participants (5)
-
Dom Grigonis
-
fangqq@northeastern.edu
-
Matti Picus
-
philippe@loco-labs.io
-
Ralf Gommers