[Numpy-discussion] Proposal: NEP 41 -- First step towards a new Datatype System
Sebastian Berg
sebastian at sipsolutions.net
Mon Mar 23 16:47:39 EDT 2020
On Mon, 2020-03-23 at 18:23 +0100, Francesc Alted wrote:
<snip>
> > If we were designing a new programming language around array
> > computing
> > principles, I do think that would be the approach I would want to
> > take/consider. But I simply lack the vision of how marrying the
> > idea
> > with the scalar language Python would work out well...
> >
>
> I have had a glance at what you are after, and it seems challenging
> indeed. IMO, trying to cope with everybody's need regarding data
> types is
> extremely costly (or more simply, not even possible). I think that a
> better approach would be to decouple the storage part of a container
> from
> its data type system. In the storage there should go things needed
> to cope
> with the data retrieval, like the itemsize, the shape, or even other
> sub-shapes for chunked datasets. Then, in the data type layer, one
> should
> be able to add meaning to the raw data: is that an integer? speed?
> temperature? a compound type?
>
I am struggling a bit fully understand the lessons to learn.
There seems some overlap of storage and DTypes? That is mainly
`itemsize` and more tricky `is/has_object`. Which is about how the data
is stored but depends on which data is stored?
In my current view these are part of the `dtype` instance, e.g. the
class `np.dtype[np.string]` (a DTypeMeta instance), will have
instances: `np.dtype[np.string](length=5, byteorder="=")` (which is
identical to `np.dtype("U5")`).
Or is it that `np.ndarray` would actually use an `np.naivearray`
internally, which is told the itemsize at construction time?
In principle, the DType class could also be much more basic, and NumPy
could subclass it (or something similar) to tag on the things it needs
to efficiently use the DTypes (outside the computational engine/UFuncs,
which cover a lot, but unfortunately I do not think everything).
- Sebastian
> Indeed the data storage layer should be able to provide a way to
> store the
> data type representation so that a container can be serialized and
> deserialized correctly. But the important thing here is that this
> decoupling between storage and types allows for different data type
> systems, so that anyone can come with a specific type system
> depending on
> her needs. One can envision here even a basic data type system (e.g.
> a
> version of what's now supported in NumPy) that can be extended with
> other
> layers, depending on the needs, so that every community can
> interchange
> data at a basic level at least.
>
> As an example, this is the basic spirit behind the under-construction
> Caterva array container (https://github.com/Blosc/Caterva). Blosc2 (
> https://github.com/Blosc/C-Blosc2) will be providing the low-level
> storage
> layer, with no information about dimensionality. Caterva will be
> building
> the multidimensional layer, but with no information about the types
> at
> all. On top of this scaffold, third-party layers will be free to
> build
> their own data dtypes, specific for every domain (the concept is
> imaged in
> slide 18 of this presentation:
> https://blosc.org/docs/Caterva-HDF5-Workshop.pdf). There is nothing
> to
> prevent to add more layers, or even a two-layer (and preferably no
> more
> than two-level) data type system: one for simple data types (e.g.
> NumPy
> ones) and one meant to be more domain-specific.
>
> Right now, I think that the limitation that is keeping the NumPy
> community
> thinking in terms of blending storage and types in the same layer is
> that
> NumPy is providing a computational engine too, and for doing
> computations,
> one need to provide both storage (including dimensionality info) and
> type
> information indeed. By using the multi-layer approach, there should
> be a
> computational layer that is laid out on top of the storage and the
> type
> layers, and hence, specific for leveraging them. Sure, that is a big
> departure from what we are used to, but as long as one can keep the
> architecture of the different layers simple, one could see
> interesting
> results in not that long time.
>
> Just my 2 cents,
> Francesc
>
>
<snip>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: This is a digitally signed message part
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20200323/8106262c/attachment.sig>
More information about the NumPy-Discussion
mailing list