[Numpy-discussion] Proposal: NEP 41 -- First step towards a new Datatype System
faltet at gmail.com
Tue Mar 24 05:48:25 EDT 2020
On Mon, Mar 23, 2020 at 9:49 PM Sebastian Berg <sebastian at sipsolutions.net>
> On Mon, 2020-03-23 at 18:23 +0100, Francesc Alted wrote:
> > > If we were designing a new programming language around array
> > > computing
> > > principles, I do think that would be the approach I would want to
> > > take/consider. But I simply lack the vision of how marrying the
> > > idea
> > > with the scalar language Python would work out well...
> > >
> > I have had a glance at what you are after, and it seems challenging
> > indeed. IMO, trying to cope with everybody's need regarding data
> > types is
> > extremely costly (or more simply, not even possible). I think that a
> > better approach would be to decouple the storage part of a container
> > from
> > its data type system. In the storage there should go things needed
> > to cope
> > with the data retrieval, like the itemsize, the shape, or even other
> > sub-shapes for chunked datasets. Then, in the data type layer, one
> > should
> > be able to add meaning to the raw data: is that an integer? speed?
> > temperature? a compound type?
> I am struggling a bit fully understand the lessons to learn.
> There seems some overlap of storage and DTypes? That is mainly
> `itemsize` and more tricky `is/has_object`. Which is about how the data
> is stored but depends on which data is stored?
> In my current view these are part of the `dtype` instance, e.g. the
> class `np.dtype[np.string]` (a DTypeMeta instance), will have
> instances: `np.dtype[np.string](length=5, byteorder="=")` (which is
> identical to `np.dtype("U5")`).
> Or is it that `np.ndarray` would actually use an `np.naivearray`
> internally, which is told the itemsize at construction time?
> In principle, the DType class could also be much more basic, and NumPy
> could subclass it (or something similar) to tag on the things it needs
> to efficiently use the DTypes (outside the computational engine/UFuncs,
> which cover a lot, but unfortunately I do not think everything).
What I am trying to say is that NumPy should be rather agnostic about
providing data types beyond the relatively simple set that already
supports. I am suggesting that focusing on providing a way to allow the
storage (not only in-memory, but also persisted arrays via .npy/.npz files)
of user-defined data types (or any other kind of metadata) and let 3rd
party libraries use this machinery to serialize/deserialize them might be a
better use of resources.
I am envisioning making life easier for libraries like e.g. xarray, which
already extends NumPy in a number of ways, and that can make use of
computational kernels different than NumPy itself (dask, probably numba
too) in order to implement functionality not present in NumPy. Allowing an
easy way to serialize library-defined data types would open the door to use
NumPy itself as a storage layer for persistency too, bringing an important
complement to NetCDF or zarr formats (remember that every format comes with
its own pros and cons).
But xarray is just an example; why not thinking on other kind of libraries
that would provide their own types, leveraging NumPy for storage and e.g.
numba for building a library of efficient functions, specific for the new
types? If done properly, these datasets can still be shared efficiently
with other libraries, as long as the basic data type system existing in
NumPy is used to access to it.
> - Sebastian
> > Indeed the data storage layer should be able to provide a way to
> > store the
> > data type representation so that a container can be serialized and
> > deserialized correctly. But the important thing here is that this
> > decoupling between storage and types allows for different data type
> > systems, so that anyone can come with a specific type system
> > depending on
> > her needs. One can envision here even a basic data type system (e.g.
> > a
> > version of what's now supported in NumPy) that can be extended with
> > other
> > layers, depending on the needs, so that every community can
> > interchange
> > data at a basic level at least.
> > As an example, this is the basic spirit behind the under-construction
> > Caterva array container (https://github.com/Blosc/Caterva). Blosc2 (
> > https://github.com/Blosc/C-Blosc2) will be providing the low-level
> > storage
> > layer, with no information about dimensionality. Caterva will be
> > building
> > the multidimensional layer, but with no information about the types
> > at
> > all. On top of this scaffold, third-party layers will be free to
> > build
> > their own data dtypes, specific for every domain (the concept is
> > imaged in
> > slide 18 of this presentation:
> > https://blosc.org/docs/Caterva-HDF5-Workshop.pdf). There is nothing
> > to
> > prevent to add more layers, or even a two-layer (and preferably no
> > more
> > than two-level) data type system: one for simple data types (e.g.
> > NumPy
> > ones) and one meant to be more domain-specific.
> > Right now, I think that the limitation that is keeping the NumPy
> > community
> > thinking in terms of blending storage and types in the same layer is
> > that
> > NumPy is providing a computational engine too, and for doing
> > computations,
> > one need to provide both storage (including dimensionality info) and
> > type
> > information indeed. By using the multi-layer approach, there should
> > be a
> > computational layer that is laid out on top of the storage and the
> > type
> > layers, and hence, specific for leveraging them. Sure, that is a big
> > departure from what we are used to, but as long as one can keep the
> > architecture of the different layers simple, one could see
> > interesting
> > results in not that long time.
> > Just my 2 cents,
> > Francesc
> NumPy-Discussion mailing list
> NumPy-Discussion at python.org
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the NumPy-Discussion