<div dir="ltr"><div dir="ltr"><div dir="ltr"><div class="gmail_default" style="font-family:arial,helvetica,sans-serif;font-size:small"><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, Mar 23, 2020 at 9:49 PM Sebastian Berg <<a href="mailto:sebastian@sipsolutions.net">sebastian@sipsolutions.net</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-style:solid;border-left-color:rgb(204,204,204);padding-left:1ex">On Mon, 2020-03-23 at 18:23 +0100, Francesc Alted wrote:<br>

<snip><br>

> > If we were designing a new programming language around array<br>

> > computing<br>

> > principles, I do think that would be the approach I would want to<br>

> > take/consider. But I simply lack the vision of how marrying the<br>

> > idea<br>

> > with the scalar language Python would work out well...<br>

> > <br>

> <br>

> I have had a glance at what you are after, and it seems challenging<br>

> indeed.  IMO, trying to cope with everybody's need regarding data<br>

> types is<br>

> extremely costly (or more simply, not even possible).  I think that a<br>

> better approach would be to decouple the storage part of a container<br>

> from<br>

> its data type system.  In the storage there should go things needed<br>

> to cope<br>

> with the data retrieval, like the itemsize, the shape, or even other<br>

> sub-shapes for chunked datasets.  Then, in the data type layer, one<br>

> should<br>

> be able to add meaning to the raw data: is that an integer?  speed?<br>

> temperature?  a compound type?<br>

> <br>

<br>

I am struggling a bit fully understand the lessons to learn.<br>

<br>

There seems some overlap of storage and DTypes? That is mainly<br>

`itemsize` and more tricky `is/has_object`. Which is about how the data<br>

is stored but depends on which data is stored?<br>

In my current view these are part of the `dtype` instance, e.g. the<br>

class `np.dtype[np.string]` (a DTypeMeta instance), will have<br>

instances: `np.dtype[np.string](length=5, byteorder="=")` (which is<br>

identical to `np.dtype("U5")`). </blockquote><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-style:solid;border-left-color:rgb(204,204,204);padding-left:1ex">

<br>

Or is it that `np.ndarray` would actually use an `np.naivearray`<br>

internally, which is told the itemsize at construction time?<br>

In principle, the DType class could also be much more basic, and NumPy<br>

could subclass it (or something similar) to tag on the things it needs<br>

to efficiently use the DTypes (outside the computational engine/UFuncs,<br>

which cover a lot, but unfortunately I do not think everything).<br></blockquote><div><br></div><div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif;font-size:small">What I am trying to say is that NumPy should be rather agnostic about providing data types beyond the relatively simple set that already supports.  I am suggesting that focusing on providing a way to allow the storage (not only in-memory, but also persisted arrays via .npy/.npz files) of user-defined data types (or any other kind of metadata)  and let 3rd party libraries use this machinery to serialize/deserialize them might be a better use of resources.</div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif;font-size:small"><br></div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif;font-size:small">I am envisioning making life easier for libraries like e.g. xarray, which already extends NumPy in a number of ways, and that can make use of computational kernels different than NumPy itself (dask, probably numba too) in order to implement functionality not present in NumPy.  Allowing an easy way to serialize library-defined data types would open the door to use NumPy itself as a storage layer for persistency too, bringing an important complement to NetCDF or zarr formats (remember that every format comes with its own pros and cons).</div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif;font-size:small"><br></div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif;font-size:small">But xarray is just an example; why not thinking on other kind of libraries that would provide their own types, leveraging NumPy for storage and e.g. numba for building a library of efficient functions, specific for the new types?  If done properly, these datasets can still be shared efficiently with other libraries, as long as the basic data type system existing in NumPy is used to access to it.</div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif;font-size:small"><br></div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif;font-size:small">Cheers,</div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif;font-size:small">Francesc</div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif;font-size:small"><br></div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif;font-size:small"><br></div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-style:solid;border-left-color:rgb(204,204,204);padding-left:1ex">

<br>

- Sebastian<br>

<br>

<br>

> Indeed the data storage layer should be able to provide a way to<br>

> store the<br>

> data type representation so that a container can be serialized and<br>

> deserialized correctly.  But the important thing here is that this<br>

> decoupling between storage and types allows for different data type<br>

> systems, so that anyone can come with a specific type system<br>

> depending on<br>

> her needs.  One can envision here even a basic data type system (e.g.<br>

> a<br>

> version of what's now supported in NumPy) that can be extended with<br>

> other<br>

> layers, depending on the needs, so that every community can<br>

> interchange<br>

> data at a basic level at least.<br>

> <br>

> As an example, this is the basic spirit behind the under-construction<br>

> Caterva array container (<a href="https://github.com/Blosc/Caterva" rel="noreferrer" target="_blank">https://github.com/Blosc/Caterva</a>).  Blosc2 (<br>

> <a href="https://github.com/Blosc/C-Blosc2" rel="noreferrer" target="_blank">https://github.com/Blosc/C-Blosc2</a>) will be providing the low-level<br>

> storage<br>

> layer, with no information about dimensionality.  Caterva will be<br>

> building<br>

> the multidimensional layer, but with no information about the types<br>

> at<br>

> all.  On top of this scaffold, third-party layers will be free to<br>

> build<br>

> their own data dtypes, specific for every domain (the concept is<br>

> imaged in<br>

> slide 18 of this presentation:<br>

> <a href="https://blosc.org/docs/Caterva-HDF5-Workshop.pdf" rel="noreferrer" target="_blank">https://blosc.org/docs/Caterva-HDF5-Workshop.pdf</a>).  There is nothing<br>

> to<br>

> prevent to add more layers, or even a two-layer (and preferably no<br>

> more<br>

> than two-level) data type system: one for simple data types (e.g.<br>

> NumPy<br>

> ones) and one meant to be more domain-specific.<br>

> <br>

> Right now, I think that the limitation that is keeping the NumPy<br>

> community<br>

> thinking in terms of blending storage and types in the same layer is<br>

> that<br>

> NumPy is providing a computational engine too, and for doing<br>

> computations,<br>

> one need to provide both storage (including dimensionality info) and<br>

> type<br>

> information indeed.  By using the multi-layer approach, there should<br>

> be a<br>

> computational layer that is laid out on top of the storage and the<br>

> type<br>

> layers, and hence, specific for leveraging them.  Sure, that is a big<br>

> departure from what we are used to, but as long as one can keep the<br>

> architecture of the different layers simple, one could see<br>

> interesting<br>

> results in not that long time.<br>

> <br>

> Just my 2 cents,<br>

> Francesc<br>

> <br>

> <br>

<snip><br>

_______________________________________________<br>

NumPy-Discussion mailing list<br>

<a href="mailto:NumPy-Discussion@python.org" target="_blank">NumPy-Discussion@python.org</a><br>

<a href="https://mail.python.org/mailman/listinfo/numpy-discussion" rel="noreferrer" target="_blank">https://mail.python.org/mailman/listinfo/numpy-discussion</a><br>

</blockquote></div><br clear="all"><div><br></div>-- <br><div dir="ltr" class="gmail_signature">Francesc Alted</div></div></div>