Re: [Numpy-discussion] Proposal: NEP 41 -- First step towards a new Datatype System

March 23, 2020

      On Mon, 2020-03-23 at 18:23 +0100, Francesc Alted wrote:
<snip>
...
...
If we were designing a new programming language around array
computing
principles, I do think that would be the approach I would want to
take/consider. But I simply lack the vision of how marrying the
idea
with the scalar language Python would work out well...
I have had a glance at what you are after, and it seems challenging
indeed.  IMO, trying to cope with everybody's need regarding data
types is
extremely costly (or more simply, not even possible).  I think that a
better approach would be to decouple the storage part of a container
from
its data type system.  In the storage there should go things needed
to cope
with the data retrieval, like the itemsize, the shape, or even other
sub-shapes for chunked datasets.  Then, in the data type layer, one
should
be able to add meaning to the raw data: is that an integer?  speed?
temperature?  a compound type?
I am struggling a bit fully understand the lessons to learn.

There seems some overlap of storage and DTypes? That is mainly
`itemsize` and more tricky `is/has_object`. Which is about how the data
is stored but depends on which data is stored?
In my current view these are part of the `dtype` instance, e.g. the
class `np.dtype[np.string]` (a DTypeMeta instance), will have
instances: `np.dtype[np.string](length=5, byteorder="=")` (which is
identical to `np.dtype("U5")`).

Or is it that `np.ndarray` would actually use an `np.naivearray`
internally, which is told the itemsize at construction time?
In principle, the DType class could also be much more basic, and NumPy
could subclass it (or something similar) to tag on the things it needs
to efficiently use the DTypes (outside the computational engine/UFuncs,
which cover a lot, but unfortunately I do not think everything).

- Sebastian
...
Indeed the data storage layer should be able to provide a way to
store the
data type representation so that a container can be serialized and
deserialized correctly.  But the important thing here is that this
decoupling between storage and types allows for different data type
systems, so that anyone can come with a specific type system
depending on
her needs.  One can envision here even a basic data type system (e.g.
a
version of what's now supported in NumPy) that can be extended with
other
layers, depending on the needs, so that every community can
interchange
data at a basic level at least.
As an example, this is the basic spirit behind the under-construction
Caterva array container (https://github.com/Blosc/Caterva).  Blosc2 (
https://github.com/Blosc/C-Blosc2) will be providing the low-level
storage
layer, with no information about dimensionality.  Caterva will be
building
the multidimensional layer, but with no information about the types
at
all.  On top of this scaffold, third-party layers will be free to
build
their own data dtypes, specific for every domain (the concept is
imaged in
slide 18 of this presentation:
https://blosc.org/docs/Caterva-HDF5-Workshop.pdf).  There is nothing
to
prevent to add more layers, or even a two-layer (and preferably no
more
than two-level) data type system: one for simple data types (e.g.
NumPy
ones) and one meant to be more domain-specific.
Right now, I think that the limitation that is keeping the NumPy
community
thinking in terms of blending storage and types in the same layer is
that
NumPy is providing a computational engine too, and for doing
computations,
one need to provide both storage (including dimensionality info) and
type
information indeed.  By using the multi-layer approach, there should
be a
computational layer that is laid out on top of the storage and the
type
layers, and hence, specific for leveraging them.  Sure, that is a big
departure from what we are used to, but as long as one can keep the
architecture of the different layers simple, one could see
interesting
results in not that long time.
Just my 2 cents,
Francesc
<snip>

Re: [Numpy-discussion] Proposal: NEP 41 -- First step towards a new Datatype System

Sebastian Berg