[Numpy-discussion] New DTypes: Are scalars a central concept in NumPy or not?

Sun Feb 23 16:56:55 EST 2020

On Sat, 2020-02-22 at 13:28 -0800, Nathaniel Smith wrote:
> Off the cuff, my intuition is that dtypes will want to be able to
> define how scalar indexing works, and let it return objects other
> than
> arrays. So e.g.:
> 
> - some dtypes might just return a zero-d array
> - some dtypes might want to return some arbitrary domain-appropriate
> type, like a datetime dtype might want to return datetime.datetime
> objects (like how dtype(object) works now)
> - some dtypes might want to go to all the trouble to define immutable
> duck-array "scalar" types (like how dtype(float) and friends work
> now)

Right, my assumption is that whatever we suggest is going to be what
most will choose, so we have the chance to move in a certain direction
and set a standard. This is to make code which may or may not deal with
0-D arrays more reliable (more below).

> 
> But I don't think we need to give that last case any special
> privileges in the dtype system. For example, I don't think we need to
> mandate that everyone who defines their own dtype MUST also implement
> a custom duck-array type to act as the scalars, or build a whole
> complex system to auto-generate such types given an arbitrary
> user-defined dtype.

(Note that "autogenerating" would be nothing more than a write-only 0-D 
array, which does not implement indexing.)

There are also categoricals, for which the type may just be "object" in
practice (you could define it closer, but it seems unlikely to be
useful). And for simple numerical types, if we go the `.item()` path,
it is arguably fine if the type is just a python type.

Maybe the crux of the problem is actuall that in general
`np.asarray(arr1d[0])` does not roundtrip for the current object dtype,
and only partially for a categorical above.
As such that is fine, but right now it is hard to tell when you will
have a scalar and when a 0D array.

Maybe it is better to talk about a potentially new `np.pyobject[type]`
datatype (i.e. an object datatype with all elements having the same
python type).
Currently writing generic code with the object dtype is tricky, because
we randomly return the object instead of arrays.
What would be the preference for such a specific dtype?

   * arr1d[0] -> scalar or array?
   * np.add(scalar, scalar) -> scalar or array
   * np.add.reduce(arr) -> scalar or array?

I think the `np.add` case we can decide fairly independently. The main
thing is the indexing. Would we want to force a `.item()` call or not?
Forcing `.item()` is in many ways simpler, I am unsure whether it would
be inconvenient often.

And, maybe the answer is just that for datatypes that do not round-trip 
easily, `.item()` is probably preferable, and for datatypes that do
round-trip scalars are fine.

- Sebastian

> 
> On Fri, Feb 21, 2020 at 5:37 PM Sebastian Berg
> <sebastian at sipsolutions.net> wrote:
> > Hi all,
> > 
> > When we create new datatypes, we have the option to make new
> > choices
> > for the new datatypes [0] (not the existing ones).
> > 
> > The question is: Should every NumPy datatype have a scalar
> > associated
> > and should operations like indexing return a scalar or a 0-D array?
> > 
> > This is in my opinion a complex, almost philosophical, question,
> > and we
> > do not have to settle anything for a long time. But, if we do not
> > decide a direction before we have many new datatypes the decision
> > will
> > make itself...
> > So happy about any ideas, even if its just a gut feeling :).
> > 
> > There are various points. I would like to mostly ignore the
> > technical
> > ones, but I am listing them anyway here:
> > 
> >   * Scalars are faster (although that can be optimized likely)
> > 
> >   * Scalars have a lower memory footprint
> > 
> >   * The current implementation incurs a technical debt in NumPy.
> >     (I do not think that is a general issue, though. We could
> >     automatically create scalars for each new datatype probably.)
> > 
> > Advantages of having no scalars:
> > 
> >   * No need to keep track of scalars to preserve them in ufuncs, or
> >     libraries using `np.asarray`, do they need
> > `np.asarray_or_scalar`?
> >     (or decide they return always arrays, although ufuncs may not)
> > 
> >   * Seems simpler in many ways, you always know the output will be
> > an
> >     array if it has to do with NumPy.
> > 
> > Advantages of having scalars:
> > 
> >   * Scalars are immutable and we are used to them from Python.
> >     A 0-D array cannot be used as a dictionary key consistently
> > [1].
> > 
> >     I.e. without scalars as first class citizen `dict[arr1d[0]]`
> >     cannot work, `dict[arr1d[0].item()]` may (if `.item()` is
> > defined,
> >     and e.g. `dict[arr1d[0].frozen()]` could make a copy to work.
> > [2]
> > 
> >   * Object arrays as we have them now make sense, `arr1d[0]` can
> >     reasonably return a Python object. I.e. arrays feel more like
> >     container if you can take elements out easily.
> > 
> > Could go both ways:
> > 
> >   * Scalar math `scalar = arr1d[0]; scalar += 1` modifies the array
> >     without scalars. With scalars `arr1d[0, ...]` clarifies the
> >     meaning. (In principle it is good to never use `arr2d[0]` to
> >     get a 1D slice, probably more-so if scalars exist.)
> > 
> > Note: array-scalars (the current NumPy scalars) are not useful in
> > my
> > opinion [3]. A scalar should not be indexed or have a shape. I do
> > not
> > believe in scalars pretending to be arrays.
> > 
> > I personally tend towards liking scalars.  If Python was a language
> > where the array (array-programming) concept was ingrained into the
> > language itself, I would lean the other way. But users are used to
> > scalars, and they "put" scalars into arrays. Array objects are in
> > some
> > ways strange in Python, and I feel not having scalars detaches them
> > further.
> > 
> > Having scalars, however also means we should preserve them. I feel
> > in
> > principle that is actually fairly straight forward. E.g. for
> > ufuncs:
> > 
> >    * np.add(scalar, scalar) -> scalar
> >    * np.add.reduce(arr, axis=None) -> scalar
> >    * np.add.reduce(arr, axis=1) -> array (even if arr is 1d)
> >    * np.add.reduce(scalar, axis=()) -> array
> > 
> > Of course libraries that do `np.asarray` would/could basically
> > chose to
> > not preserve scalars: Their signature is defined as taking strictly
> > array input.
> > 
> > Cheers,
> > 
> > Sebastian
> > 
> > 
> > [0] At best this can be a vision to decide which way they may
> > evolve.
> > 
> > [1] E.g. PyTorch uses `hash(tensor) == id(tensor)` which is
> > arguably
> > strange. E.g. Quantity defines hash correctly, but does not fully
> > ensure immutability for 0-D Quantities. Ensuring immutability in a
> > world where "views" are a central concept requires a write-only
> > copy.
> > 
> > [2] Arguably `.item()` would always return a scalar, but it would
> > be a
> > second class citizen. (Although if it returns a scalar, at least we
> > already have a scalar implementation.)
> > 
> > [3] They are necessary due to technical debt for NumPy datatypes
> > though.
> > _______________________________________________
> > NumPy-Discussion mailing list
> > NumPy-Discussion at python.org
> > https://mail.python.org/mailman/listinfo/numpy-discussion
> 
> 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: This is a digitally signed message part
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20200223/9ceec910/attachment.sig>