
On Mon, 2020-03-23 at 18:23 +0100, Francesc Alted wrote: <snip>
If we were designing a new programming language around array computing principles, I do think that would be the approach I would want to take/consider. But I simply lack the vision of how marrying the idea with the scalar language Python would work out well...
I have had a glance at what you are after, and it seems challenging indeed. IMO, trying to cope with everybody's need regarding data types is extremely costly (or more simply, not even possible). I think that a better approach would be to decouple the storage part of a container from its data type system. In the storage there should go things needed to cope with the data retrieval, like the itemsize, the shape, or even other sub-shapes for chunked datasets. Then, in the data type layer, one should be able to add meaning to the raw data: is that an integer? speed? temperature? a compound type?
I am struggling a bit fully understand the lessons to learn. There seems some overlap of storage and DTypes? That is mainly `itemsize` and more tricky `is/has_object`. Which is about how the data is stored but depends on which data is stored? In my current view these are part of the `dtype` instance, e.g. the class `np.dtype[np.string]` (a DTypeMeta instance), will have instances: `np.dtype[np.string](length=5, byteorder="=")` (which is identical to `np.dtype("U5")`). Or is it that `np.ndarray` would actually use an `np.naivearray` internally, which is told the itemsize at construction time? In principle, the DType class could also be much more basic, and NumPy could subclass it (or something similar) to tag on the things it needs to efficiently use the DTypes (outside the computational engine/UFuncs, which cover a lot, but unfortunately I do not think everything). - Sebastian
Indeed the data storage layer should be able to provide a way to store the data type representation so that a container can be serialized and deserialized correctly. But the important thing here is that this decoupling between storage and types allows for different data type systems, so that anyone can come with a specific type system depending on her needs. One can envision here even a basic data type system (e.g. a version of what's now supported in NumPy) that can be extended with other layers, depending on the needs, so that every community can interchange data at a basic level at least.
As an example, this is the basic spirit behind the under-construction Caterva array container (https://github.com/Blosc/Caterva). Blosc2 ( https://github.com/Blosc/C-Blosc2) will be providing the low-level storage layer, with no information about dimensionality. Caterva will be building the multidimensional layer, but with no information about the types at all. On top of this scaffold, third-party layers will be free to build their own data dtypes, specific for every domain (the concept is imaged in slide 18 of this presentation: https://blosc.org/docs/Caterva-HDF5-Workshop.pdf). There is nothing to prevent to add more layers, or even a two-layer (and preferably no more than two-level) data type system: one for simple data types (e.g. NumPy ones) and one meant to be more domain-specific.
Right now, I think that the limitation that is keeping the NumPy community thinking in terms of blending storage and types in the same layer is that NumPy is providing a computational engine too, and for doing computations, one need to provide both storage (including dimensionality info) and type information indeed. By using the multi-layer approach, there should be a computational layer that is laid out on top of the storage and the type layers, and hence, specific for leveraging them. Sure, that is a big departure from what we are used to, but as long as one can keep the architecture of the different layers simple, one could see interesting results in not that long time.
Just my 2 cents, Francesc
<snip>