[Numpy-discussion] New DTypes: Are scalars a central concept in NumPy or not?
Chris Barker
chris.barker at noaa.gov
Mon Mar 23 14:45:51 EDT 2020
I've always found the duality of zero-d arrays an scalars confusing, and
I'm sure I'm not alone.
Having both is just plain weird.
But, backward compatibility aside, could we have ONLY Scalars?
When we index into an array, the dimensionality is reduced by one, so
indexing into a 1D array has to get us something: but the zero-d array is a
really weird object -- do we really need it?
There is certainly a need for more numpy-like scalars: more than the built
in data types, and some handy attributes and methods, like dtype,
.itemsize, etc. But could we make an enhanced scalar that had everything we
actually need from a zero-d array?
The key point would be mutability -- but do we really need mutable scalars?
I can't think of any time I've needed that, when I couldn't have used a 1-d
array of length 1.
Is there a use case for zero-d arrays that could not be met with an
enhanced scalar?
-CHB
On Mon, Feb 24, 2020 at 12:30 PM Allan Haldane <allanhaldane at gmail.com>
wrote:
> I have some thoughts on scalars from playing with ndarray ducktypes
> (__array_function__), eg a MaskedArray ndarray-ducktype, for which I
> wanted an associated "MaskedScalar" type.
>
> In summary, the ways scalars currently work makes ducktyping
> (duck-scalars) difficult:
>
> * numpy scalar types are not subclassable, so my duck-scalars aren't
> subclasses of numpy scalars and aren't in the type hierarchy
> * even if scalars were subclassable, I would have to subclass each
> scalar datatype individually to make masked versions
> * lots of code checks `np.isinstance(var, np.float64)` which breaks
> for my duck-scalars
> * it was difficult to distinguish between a duck-scalar and a duck-0d
> array. The method I used in the end seems hacky.
>
> This has led to some daydreams about how scalars should work, and also
> led me last to read through your NEPs 40/41 with specific focus on what
> you said about scalars, and was about to post there until I saw this
> discussion. I agree with what you said in the NEPs about not making
> scalars be dtype instances.
>
> Here is what ducktypes led me to:
>
> If we are able to do something like define a `np.numpy_scalar` type
> covering all numpy scalars, which has a `.dtype` attribute like you
> describe in the NEPs, then that would seem to solve the ducktype
> problems above. Ducktype implementors would need to make a "duck-scalar"
> type in parallel to their "duck-ndarray" type, but I found that to be
> pretty easy using an abstract class in my MaskedArray ducktype, since
> the MaskedArray and MaskedScalar share a lot of behavior.
>
> A numpy_scalar type would also help solve some object-array problems if
> the object scalars are wrapped in the np_scalar type. A long time ago I
> started to try to fix up various funny/strange behaviors of object
> datatypes, but there are lots of special cases, and the main problem was
> that the returned objects (eg from indexing) were not numpy types and
> did not support numpy attributes or indexing. Wrapping the returned
> object in `np.numpy_scalar` might add an extra slight annoyance to
> people who want to unwrap the object, but I think it would make object
> arrays less buggy and make code using object arrays easier to reason
> about and debug.
>
> Finally, a few random votes/comments based on the other emails on the list:
>
> I think scalars have a place in numpy (rather than just reusing 0d
> arrays), since there is a clear use in having hashable, immutable
> scalars. Structured scalars should probably be immutable.
>
> I agree with your suggestion that scalars should not be indexable. Thus,
> my duck-scalars (and proposed numpy_scalar) would not be indexable.
> However, I think they should encode their datatype though a .dtype
> attribute like ndarrays, rather than by inheritance.
>
> Also, something to think about is that currently numpy scalars satisfy
> the property `isinstance(np.float64(1), float)`, i.e they are within the
> python numerical type hierarchy. 0d arrays do not have this property. My
> proposal above would break this. I'm not sure what to think about
> whether this is a good property to maintain or not.
>
> Cheers,
> Allan
>
>
>
> On 2/21/20 8:37 PM, Sebastian Berg wrote:
> > Hi all,
> >
> > When we create new datatypes, we have the option to make new choices
> > for the new datatypes [0] (not the existing ones).
> >
> > The question is: Should every NumPy datatype have a scalar associated
> > and should operations like indexing return a scalar or a 0-D array?
> >
> > This is in my opinion a complex, almost philosophical, question, and we
> > do not have to settle anything for a long time. But, if we do not
> > decide a direction before we have many new datatypes the decision will
> > make itself...
> > So happy about any ideas, even if its just a gut feeling :).
> >
> > There are various points. I would like to mostly ignore the technical
> > ones, but I am listing them anyway here:
> >
> > * Scalars are faster (although that can be optimized likely)
> >
> > * Scalars have a lower memory footprint
> >
> > * The current implementation incurs a technical debt in NumPy.
> > (I do not think that is a general issue, though. We could
> > automatically create scalars for each new datatype probably.)
> >
> > Advantages of having no scalars:
> >
> > * No need to keep track of scalars to preserve them in ufuncs, or
> > libraries using `np.asarray`, do they need `np.asarray_or_scalar`?
> > (or decide they return always arrays, although ufuncs may not)
> >
> > * Seems simpler in many ways, you always know the output will be an
> > array if it has to do with NumPy.
> >
> > Advantages of having scalars:
> >
> > * Scalars are immutable and we are used to them from Python.
> > A 0-D array cannot be used as a dictionary key consistently [1].
> >
> > I.e. without scalars as first class citizen `dict[arr1d[0]]`
> > cannot work, `dict[arr1d[0].item()]` may (if `.item()` is defined,
> > and e.g. `dict[arr1d[0].frozen()]` could make a copy to work. [2]
> >
> > * Object arrays as we have them now make sense, `arr1d[0]` can
> > reasonably return a Python object. I.e. arrays feel more like
> > container if you can take elements out easily.
> >
> > Could go both ways:
> >
> > * Scalar math `scalar = arr1d[0]; scalar += 1` modifies the array
> > without scalars. With scalars `arr1d[0, ...]` clarifies the
> > meaning. (In principle it is good to never use `arr2d[0]` to
> > get a 1D slice, probably more-so if scalars exist.)
> >
> > Note: array-scalars (the current NumPy scalars) are not useful in my
> > opinion [3]. A scalar should not be indexed or have a shape. I do not
> > believe in scalars pretending to be arrays.
> >
> > I personally tend towards liking scalars. If Python was a language
> > where the array (array-programming) concept was ingrained into the
> > language itself, I would lean the other way. But users are used to
> > scalars, and they "put" scalars into arrays. Array objects are in some
> > ways strange in Python, and I feel not having scalars detaches them
> > further.
> >
> > Having scalars, however also means we should preserve them. I feel in
> > principle that is actually fairly straight forward. E.g. for ufuncs:
> >
> > * np.add(scalar, scalar) -> scalar
> > * np.add.reduce(arr, axis=None) -> scalar
> > * np.add.reduce(arr, axis=1) -> array (even if arr is 1d)
> > * np.add.reduce(scalar, axis=()) -> array
> >
> > Of course libraries that do `np.asarray` would/could basically chose to
> > not preserve scalars: Their signature is defined as taking strictly
> > array input.
> >
> > Cheers,
> >
> > Sebastian
> >
> >
> > [0] At best this can be a vision to decide which way they may evolve.
> >
> > [1] E.g. PyTorch uses `hash(tensor) == id(tensor)` which is arguably
> > strange. E.g. Quantity defines hash correctly, but does not fully
> > ensure immutability for 0-D Quantities. Ensuring immutability in a
> > world where "views" are a central concept requires a write-only copy.
> >
> > [2] Arguably `.item()` would always return a scalar, but it would be a
> > second class citizen. (Although if it returns a scalar, at least we
> > already have a scalar implementation.)
> >
> > [3] They are necessary due to technical debt for NumPy datatypes
> > though.
> >
> >
> > _______________________________________________
> > NumPy-Discussion mailing list
> > NumPy-Discussion at python.org
> > https://mail.python.org/mailman/listinfo/numpy-discussion
> >
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
--
Christopher Barker, Ph.D.
Oceanographer
Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception
Chris.Barker at noaa.gov
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20200323/45f55d2f/attachment.html>
More information about the NumPy-Discussion
mailing list