New DTypes: Are scalars a central concept in NumPy or not?
![](https://secure.gravatar.com/avatar/b4f6d4f8b501cb05fd054944a166a121.jpg?s=120&d=mm&r=g)
Hi all, When we create new datatypes, we have the option to make new choices for the new datatypes [0] (not the existing ones). The question is: Should every NumPy datatype have a scalar associated and should operations like indexing return a scalar or a 0-D array? This is in my opinion a complex, almost philosophical, question, and we do not have to settle anything for a long time. But, if we do not decide a direction before we have many new datatypes the decision will make itself... So happy about any ideas, even if its just a gut feeling :). There are various points. I would like to mostly ignore the technical ones, but I am listing them anyway here: * Scalars are faster (although that can be optimized likely) * Scalars have a lower memory footprint * The current implementation incurs a technical debt in NumPy. (I do not think that is a general issue, though. We could automatically create scalars for each new datatype probably.) Advantages of having no scalars: * No need to keep track of scalars to preserve them in ufuncs, or libraries using `np.asarray`, do they need `np.asarray_or_scalar`? (or decide they return always arrays, although ufuncs may not) * Seems simpler in many ways, you always know the output will be an array if it has to do with NumPy. Advantages of having scalars: * Scalars are immutable and we are used to them from Python. A 0-D array cannot be used as a dictionary key consistently [1]. I.e. without scalars as first class citizen `dict[arr1d[0]]` cannot work, `dict[arr1d[0].item()]` may (if `.item()` is defined, and e.g. `dict[arr1d[0].frozen()]` could make a copy to work. [2] * Object arrays as we have them now make sense, `arr1d[0]` can reasonably return a Python object. I.e. arrays feel more like container if you can take elements out easily. Could go both ways: * Scalar math `scalar = arr1d[0]; scalar += 1` modifies the array without scalars. With scalars `arr1d[0, ...]` clarifies the meaning. (In principle it is good to never use `arr2d[0]` to get a 1D slice, probably more-so if scalars exist.) Note: array-scalars (the current NumPy scalars) are not useful in my opinion [3]. A scalar should not be indexed or have a shape. I do not believe in scalars pretending to be arrays. I personally tend towards liking scalars. If Python was a language where the array (array-programming) concept was ingrained into the language itself, I would lean the other way. But users are used to scalars, and they "put" scalars into arrays. Array objects are in some ways strange in Python, and I feel not having scalars detaches them further. Having scalars, however also means we should preserve them. I feel in principle that is actually fairly straight forward. E.g. for ufuncs: * np.add(scalar, scalar) -> scalar * np.add.reduce(arr, axis=None) -> scalar * np.add.reduce(arr, axis=1) -> array (even if arr is 1d) * np.add.reduce(scalar, axis=()) -> array Of course libraries that do `np.asarray` would/could basically chose to not preserve scalars: Their signature is defined as taking strictly array input. Cheers, Sebastian [0] At best this can be a vision to decide which way they may evolve. [1] E.g. PyTorch uses `hash(tensor) == id(tensor)` which is arguably strange. E.g. Quantity defines hash correctly, but does not fully ensure immutability for 0-D Quantities. Ensuring immutability in a world where "views" are a central concept requires a write-only copy. [2] Arguably `.item()` would always return a scalar, but it would be a second class citizen. (Although if it returns a scalar, at least we already have a scalar implementation.) [3] They are necessary due to technical debt for NumPy datatypes though.
![](https://secure.gravatar.com/avatar/008b55030cffb9a4c4f7d8422e10343e.jpg?s=120&d=mm&r=g)
I personally have always found it weird and annoying to deal with 0-D arrays, so +1 for scalars!* Juan *: admittedly, I have almost no grasp of the underlying NumPy implementation complexities, but I will happily take Sebastian's word that scalars can be consistent with the library. On Fri, 21 Feb 2020, at 7:37 PM, Sebastian Berg wrote:
![](https://secure.gravatar.com/avatar/bd4477dc26bf9941268fbfa05abdeae6.jpg?s=120&d=mm&r=g)
Hi Sebastian, Just to clarify the difference:
x = np.float64(42) y = np.array(42, dtype=float)
Here `x` is a scalar and `y` is a 0D array, correct? If that's the case, not having the former would be very confusing for users (at least, that would be very confusing to me, FWIW). If anything, I think it'd be cleaner to not have the latter, and only have either scalars or 1D arrays (i.e., N-D arrays with N>=1), but it is probably way too late to even think about it anyway. Cheers, Evgeni On Sat, Feb 22, 2020 at 4:37 AM Sebastian Berg <sebastian@sipsolutions.net> wrote:
![](https://secure.gravatar.com/avatar/ad13088a623822caf74e635a68a55eae.jpg?s=120&d=mm&r=g)
On Sat, Feb 22, 2020 at 9:34 AM <josef.pktd@gmail.com> wrote:
also there is the question of which scalar .item() versus [()] This was used in the old times in scipy.stats, and I just saw https://github.com/scipy/scipy/pull/11165#issuecomment-589952838 aside: AFAIR, I use 0-dim arrays also to ensure that I have a numpy dtype and not, e.g. some equivalent python type Josef
![](https://secure.gravatar.com/avatar/ad13088a623822caf74e635a68a55eae.jpg?s=120&d=mm&r=g)
On Sat, Feb 22, 2020 at 9:41 AM <josef.pktd@gmail.com> wrote:
0-dim as mutable pseudo-scalar a = np.asarray(5) a, id(a) (array(5), 844574884528) a[()] = 1 a, id(a) (array(1), 844574884528) maybe I never used that, In a recent similar case, I could use just a 1-d list or array to work around python's muting or mutability behavior
![](https://secure.gravatar.com/avatar/97c543aca1ac7bbcfb5279d0300c8330.jpg?s=120&d=mm&r=g)
Off the cuff, my intuition is that dtypes will want to be able to define how scalar indexing works, and let it return objects other than arrays. So e.g.: - some dtypes might just return a zero-d array - some dtypes might want to return some arbitrary domain-appropriate type, like a datetime dtype might want to return datetime.datetime objects (like how dtype(object) works now) - some dtypes might want to go to all the trouble to define immutable duck-array "scalar" types (like how dtype(float) and friends work now) But I don't think we need to give that last case any special privileges in the dtype system. For example, I don't think we need to mandate that everyone who defines their own dtype MUST also implement a custom duck-array type to act as the scalars, or build a whole complex system to auto-generate such types given an arbitrary user-defined dtype. -n On Fri, Feb 21, 2020 at 5:37 PM Sebastian Berg <sebastian@sipsolutions.net> wrote:
-- Nathaniel J. Smith -- https://vorpus.org
![](https://secure.gravatar.com/avatar/b4f6d4f8b501cb05fd054944a166a121.jpg?s=120&d=mm&r=g)
On Sat, 2020-02-22 at 13:28 -0800, Nathaniel Smith wrote:
Right, my assumption is that whatever we suggest is going to be what most will choose, so we have the chance to move in a certain direction and set a standard. This is to make code which may or may not deal with 0-D arrays more reliable (more below).
(Note that "autogenerating" would be nothing more than a write-only 0-D array, which does not implement indexing.) There are also categoricals, for which the type may just be "object" in practice (you could define it closer, but it seems unlikely to be useful). And for simple numerical types, if we go the `.item()` path, it is arguably fine if the type is just a python type. Maybe the crux of the problem is actuall that in general `np.asarray(arr1d[0])` does not roundtrip for the current object dtype, and only partially for a categorical above. As such that is fine, but right now it is hard to tell when you will have a scalar and when a 0D array. Maybe it is better to talk about a potentially new `np.pyobject[type]` datatype (i.e. an object datatype with all elements having the same python type). Currently writing generic code with the object dtype is tricky, because we randomly return the object instead of arrays. What would be the preference for such a specific dtype? * arr1d[0] -> scalar or array? * np.add(scalar, scalar) -> scalar or array * np.add.reduce(arr) -> scalar or array? I think the `np.add` case we can decide fairly independently. The main thing is the indexing. Would we want to force a `.item()` call or not? Forcing `.item()` is in many ways simpler, I am unsure whether it would be inconvenient often. And, maybe the answer is just that for datatypes that do not round-trip easily, `.item()` is probably preferable, and for datatypes that do round-trip scalars are fine. - Sebastian
![](https://secure.gravatar.com/avatar/1198e2d145718c841565712312e04227.jpg?s=120&d=mm&r=g)
Hi, Sebastian, On 22.02.20, 02:37, "NumPy-Discussion on behalf of Sebastian Berg" <numpy-discussion-bounces+hameerabbasi=yahoo.com@python.org on behalf of sebastian@sipsolutions.net> wrote: Hi all, When we create new datatypes, we have the option to make new choices for the new datatypes [0] (not the existing ones). The question is: Should every NumPy datatype have a scalar associated and should operations like indexing return a scalar or a 0-D array? This is in my opinion a complex, almost philosophical, question, and we do not have to settle anything for a long time. But, if we do not decide a direction before we have many new datatypes the decision will make itself... So happy about any ideas, even if its just a gut feeling :). There are various points. I would like to mostly ignore the technical ones, but I am listing them anyway here: * Scalars are faster (although that can be optimized likely) * Scalars have a lower memory footprint * The current implementation incurs a technical debt in NumPy. (I do not think that is a general issue, though. We could automatically create scalars for each new datatype probably.) Advantages of having no scalars: * No need to keep track of scalars to preserve them in ufuncs, or libraries using `np.asarray`, do they need `np.asarray_or_scalar`? (or decide they return always arrays, although ufuncs may not) * Seems simpler in many ways, you always know the output will be an array if it has to do with NumPy. Advantages of having scalars: * Scalars are immutable and we are used to them from Python. A 0-D array cannot be used as a dictionary key consistently [1]. I.e. without scalars as first class citizen `dict[arr1d[0]]` cannot work, `dict[arr1d[0].item()]` may (if `.item()` is defined, and e.g. `dict[arr1d[0].frozen()]` could make a copy to work. [2] * Object arrays as we have them now make sense, `arr1d[0]` can reasonably return a Python object. I.e. arrays feel more like container if you can take elements out easily. Could go both ways: * Scalar math `scalar = arr1d[0]; scalar += 1` modifies the array without scalars. With scalars `arr1d[0, ...]` clarifies the meaning. (In principle it is good to never use `arr2d[0]` to get a 1D slice, probably more-so if scalars exist.) From a usability perspective, one could argue that if the dimension of the array one is indexing into is known and the user isn't advanced, then the behavior expected is one of scalars and not 0D arrays. If, however, the input dimension is unknown, then the behavior switch at 0D and the need for an extra ellipsis to ensure array-ness makes things confusing to regular users. I am file with the current behavior of indexing, as anything else would likely be a large backwards-compat break. Note: array-scalars (the current NumPy scalars) are not useful in my opinion [3]. A scalar should not be indexed or have a shape. I do not believe in scalars pretending to be arrays. I personally tend towards liking scalars. If Python was a language where the array (array-programming) concept was ingrained into the language itself, I would lean the other way. But users are used to scalars, and they "put" scalars into arrays. Array objects are in some ways strange in Python, and I feel not having scalars detaches them further. Having scalars, however also means we should preserve them. I feel in principle that is actually fairly straight forward. E.g. for ufuncs: * np.add(scalar, scalar) -> scalar * np.add.reduce(arr, axis=None) -> scalar * np.add.reduce(arr, axis=1) -> array (even if arr is 1d) * np.add.reduce(scalar, axis=()) -> array I love this idea. Of course libraries that do `np.asarray` would/could basically chose to not preserve scalars: Their signature is defined as taking strictly array input. Cheers, Sebastian [0] At best this can be a vision to decide which way they may evolve. [1] E.g. PyTorch uses `hash(tensor) == id(tensor)` which is arguably strange. E.g. Quantity defines hash correctly, but does not fully ensure immutability for 0-D Quantities. Ensuring immutability in a world where "views" are a central concept requires a write-only copy. [2] Arguably `.item()` would always return a scalar, but it would be a second class citizen. (Although if it returns a scalar, at least we already have a scalar implementation.) [3] They are necessary due to technical debt for NumPy datatypes though. _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
![](https://secure.gravatar.com/avatar/71832763447894e7c7f3f64bfd19c13f.jpg?s=120&d=mm&r=g)
I have some thoughts on scalars from playing with ndarray ducktypes (__array_function__), eg a MaskedArray ndarray-ducktype, for which I wanted an associated "MaskedScalar" type. In summary, the ways scalars currently work makes ducktyping (duck-scalars) difficult: * numpy scalar types are not subclassable, so my duck-scalars aren't subclasses of numpy scalars and aren't in the type hierarchy * even if scalars were subclassable, I would have to subclass each scalar datatype individually to make masked versions * lots of code checks `np.isinstance(var, np.float64)` which breaks for my duck-scalars * it was difficult to distinguish between a duck-scalar and a duck-0d array. The method I used in the end seems hacky. This has led to some daydreams about how scalars should work, and also led me last to read through your NEPs 40/41 with specific focus on what you said about scalars, and was about to post there until I saw this discussion. I agree with what you said in the NEPs about not making scalars be dtype instances. Here is what ducktypes led me to: If we are able to do something like define a `np.numpy_scalar` type covering all numpy scalars, which has a `.dtype` attribute like you describe in the NEPs, then that would seem to solve the ducktype problems above. Ducktype implementors would need to make a "duck-scalar" type in parallel to their "duck-ndarray" type, but I found that to be pretty easy using an abstract class in my MaskedArray ducktype, since the MaskedArray and MaskedScalar share a lot of behavior. A numpy_scalar type would also help solve some object-array problems if the object scalars are wrapped in the np_scalar type. A long time ago I started to try to fix up various funny/strange behaviors of object datatypes, but there are lots of special cases, and the main problem was that the returned objects (eg from indexing) were not numpy types and did not support numpy attributes or indexing. Wrapping the returned object in `np.numpy_scalar` might add an extra slight annoyance to people who want to unwrap the object, but I think it would make object arrays less buggy and make code using object arrays easier to reason about and debug. Finally, a few random votes/comments based on the other emails on the list: I think scalars have a place in numpy (rather than just reusing 0d arrays), since there is a clear use in having hashable, immutable scalars. Structured scalars should probably be immutable. I agree with your suggestion that scalars should not be indexable. Thus, my duck-scalars (and proposed numpy_scalar) would not be indexable. However, I think they should encode their datatype though a .dtype attribute like ndarrays, rather than by inheritance. Also, something to think about is that currently numpy scalars satisfy the property `isinstance(np.float64(1), float)`, i.e they are within the python numerical type hierarchy. 0d arrays do not have this property. My proposal above would break this. I'm not sure what to think about whether this is a good property to maintain or not. Cheers, Allan On 2/21/20 8:37 PM, Sebastian Berg wrote:
![](https://secure.gravatar.com/avatar/5dde29b54a3f1b76b2541d0a4a9b232c.jpg?s=120&d=mm&r=g)
I've always found the duality of zero-d arrays an scalars confusing, and I'm sure I'm not alone. Having both is just plain weird. But, backward compatibility aside, could we have ONLY Scalars? When we index into an array, the dimensionality is reduced by one, so indexing into a 1D array has to get us something: but the zero-d array is a really weird object -- do we really need it? There is certainly a need for more numpy-like scalars: more than the built in data types, and some handy attributes and methods, like dtype, .itemsize, etc. But could we make an enhanced scalar that had everything we actually need from a zero-d array? The key point would be mutability -- but do we really need mutable scalars? I can't think of any time I've needed that, when I couldn't have used a 1-d array of length 1. Is there a use case for zero-d arrays that could not be met with an enhanced scalar? -CHB On Mon, Feb 24, 2020 at 12:30 PM Allan Haldane <allanhaldane@gmail.com> wrote:
-- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
![](https://secure.gravatar.com/avatar/b4f6d4f8b501cb05fd054944a166a121.jpg?s=120&d=mm&r=g)
On Mon, 2020-03-23 at 11:45 -0700, Chris Barker wrote:
I guess so, it is a tricky situation, and I do not really have an answer.
Well, it is hard to write functions that work on N-Dimensions (where N can be 0), if the 0-D array does not exist. You can get away with scalars in most cases, because they pretend to be arrays in most cases (aside from mutability). But I am pretty sure we have a bunch of cases that need `res = np.asarray(res)` simply because `res` is N-D but could then be silently converted to a scalar. E.g. see https://github.com/numpy/numpy/issues/13105 for an issue about this (although it does not actually list any specific problems). - Sebastian
![](https://secure.gravatar.com/avatar/5dde29b54a3f1b76b2541d0a4a9b232c.jpg?s=120&d=mm&r=g)
sorry to have fallen off the numpy grid for a bit, but: On Mon, Mar 23, 2020 at 1:37 PM Sebastian Berg <sebastian@sipsolutions.net> wrote:
On Mon, 2020-03-23 at 11:45 -0700, Chris Barker wrote:
But, backward compatibility aside, could we have ONLY Scalars?
I'm not sure this is insolvable (again, backwards compatibility aside) -- after all, one of the key issues is that it's undetermined what the rank should be of: array(a_scalar) -- 0-d is the only unambiguous answer, but then it's not really an array in the usual sense anyway. So in theory, we could not allow that conversion without specifying a rank. at the end of the day, there has to be some endpoint on how far you can reduce the rank of an array and have it work -- why not have 1 be the lower limit? -CHB
-- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
![](https://secure.gravatar.com/avatar/b4f6d4f8b501cb05fd054944a166a121.jpg?s=120&d=mm&r=g)
On Wed, 2020-04-08 at 12:37 -0700, Chris Barker wrote:
So as a (silly) example, the following does not generalize to 0d, even though it should: def weird_normalize_by_trace_inplace(stacked_matrices) """Devides matrices by their trace but retains sign (works in-place, and thus e.g. not for integer arrays) Parameters ---------- stacked_matrices : (..., N, M) ndarray """ assert stacked_matrices.shape[-1] == stacked_matrices.shape[-2] trace = np.trace(stacked_matrices, axis1=-2, axis2=-1) trace[trace < 0] *= -1 stacked_matrices /= trace Sure that function does not make sense and you could rewrite it, but the fact is that in that function you want to conditionally modify trace in-place, but trace can be 0d and the "conditional" modification breaks down. - Sebastian
![](https://secure.gravatar.com/avatar/5dde29b54a3f1b76b2541d0a4a9b232c.jpg?s=120&d=mm&r=g)
On Wed, Apr 8, 2020 at 1:17 PM Sebastian Berg <sebastian@sipsolutions.net> wrote:
I guess that's what I'm getting at -- there is always an endpoint to reducing the rank. a function that's designed to work on a "stack" of something doesn't have to work on a single something, when it can, instead, work on a "stack" of hight one. Isn't the trace of a matrix always a scalar? and thus the trace(s) of a stack of matrixes would always be 1-D? So that function should do something like: stacked_matrixes.shape = (-1, M, M) yes? and then it would always work. Again, backwards compatibility, but there is a reason the np.atleast_*() functions exist -- you often need to make sure your inputs have the dimensionality expected. -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
![](https://secure.gravatar.com/avatar/008b55030cffb9a4c4f7d8422e10343e.jpg?s=120&d=mm&r=g)
I personally have always found it weird and annoying to deal with 0-D arrays, so +1 for scalars!* Juan *: admittedly, I have almost no grasp of the underlying NumPy implementation complexities, but I will happily take Sebastian's word that scalars can be consistent with the library. On Fri, 21 Feb 2020, at 7:37 PM, Sebastian Berg wrote:
![](https://secure.gravatar.com/avatar/bd4477dc26bf9941268fbfa05abdeae6.jpg?s=120&d=mm&r=g)
Hi Sebastian, Just to clarify the difference:
x = np.float64(42) y = np.array(42, dtype=float)
Here `x` is a scalar and `y` is a 0D array, correct? If that's the case, not having the former would be very confusing for users (at least, that would be very confusing to me, FWIW). If anything, I think it'd be cleaner to not have the latter, and only have either scalars or 1D arrays (i.e., N-D arrays with N>=1), but it is probably way too late to even think about it anyway. Cheers, Evgeni On Sat, Feb 22, 2020 at 4:37 AM Sebastian Berg <sebastian@sipsolutions.net> wrote:
![](https://secure.gravatar.com/avatar/ad13088a623822caf74e635a68a55eae.jpg?s=120&d=mm&r=g)
On Sat, Feb 22, 2020 at 9:34 AM <josef.pktd@gmail.com> wrote:
also there is the question of which scalar .item() versus [()] This was used in the old times in scipy.stats, and I just saw https://github.com/scipy/scipy/pull/11165#issuecomment-589952838 aside: AFAIR, I use 0-dim arrays also to ensure that I have a numpy dtype and not, e.g. some equivalent python type Josef
![](https://secure.gravatar.com/avatar/ad13088a623822caf74e635a68a55eae.jpg?s=120&d=mm&r=g)
On Sat, Feb 22, 2020 at 9:41 AM <josef.pktd@gmail.com> wrote:
0-dim as mutable pseudo-scalar a = np.asarray(5) a, id(a) (array(5), 844574884528) a[()] = 1 a, id(a) (array(1), 844574884528) maybe I never used that, In a recent similar case, I could use just a 1-d list or array to work around python's muting or mutability behavior
![](https://secure.gravatar.com/avatar/97c543aca1ac7bbcfb5279d0300c8330.jpg?s=120&d=mm&r=g)
Off the cuff, my intuition is that dtypes will want to be able to define how scalar indexing works, and let it return objects other than arrays. So e.g.: - some dtypes might just return a zero-d array - some dtypes might want to return some arbitrary domain-appropriate type, like a datetime dtype might want to return datetime.datetime objects (like how dtype(object) works now) - some dtypes might want to go to all the trouble to define immutable duck-array "scalar" types (like how dtype(float) and friends work now) But I don't think we need to give that last case any special privileges in the dtype system. For example, I don't think we need to mandate that everyone who defines their own dtype MUST also implement a custom duck-array type to act as the scalars, or build a whole complex system to auto-generate such types given an arbitrary user-defined dtype. -n On Fri, Feb 21, 2020 at 5:37 PM Sebastian Berg <sebastian@sipsolutions.net> wrote:
-- Nathaniel J. Smith -- https://vorpus.org
![](https://secure.gravatar.com/avatar/b4f6d4f8b501cb05fd054944a166a121.jpg?s=120&d=mm&r=g)
On Sat, 2020-02-22 at 13:28 -0800, Nathaniel Smith wrote:
Right, my assumption is that whatever we suggest is going to be what most will choose, so we have the chance to move in a certain direction and set a standard. This is to make code which may or may not deal with 0-D arrays more reliable (more below).
(Note that "autogenerating" would be nothing more than a write-only 0-D array, which does not implement indexing.) There are also categoricals, for which the type may just be "object" in practice (you could define it closer, but it seems unlikely to be useful). And for simple numerical types, if we go the `.item()` path, it is arguably fine if the type is just a python type. Maybe the crux of the problem is actuall that in general `np.asarray(arr1d[0])` does not roundtrip for the current object dtype, and only partially for a categorical above. As such that is fine, but right now it is hard to tell when you will have a scalar and when a 0D array. Maybe it is better to talk about a potentially new `np.pyobject[type]` datatype (i.e. an object datatype with all elements having the same python type). Currently writing generic code with the object dtype is tricky, because we randomly return the object instead of arrays. What would be the preference for such a specific dtype? * arr1d[0] -> scalar or array? * np.add(scalar, scalar) -> scalar or array * np.add.reduce(arr) -> scalar or array? I think the `np.add` case we can decide fairly independently. The main thing is the indexing. Would we want to force a `.item()` call or not? Forcing `.item()` is in many ways simpler, I am unsure whether it would be inconvenient often. And, maybe the answer is just that for datatypes that do not round-trip easily, `.item()` is probably preferable, and for datatypes that do round-trip scalars are fine. - Sebastian
![](https://secure.gravatar.com/avatar/1198e2d145718c841565712312e04227.jpg?s=120&d=mm&r=g)
Hi, Sebastian, On 22.02.20, 02:37, "NumPy-Discussion on behalf of Sebastian Berg" <numpy-discussion-bounces+hameerabbasi=yahoo.com@python.org on behalf of sebastian@sipsolutions.net> wrote: Hi all, When we create new datatypes, we have the option to make new choices for the new datatypes [0] (not the existing ones). The question is: Should every NumPy datatype have a scalar associated and should operations like indexing return a scalar or a 0-D array? This is in my opinion a complex, almost philosophical, question, and we do not have to settle anything for a long time. But, if we do not decide a direction before we have many new datatypes the decision will make itself... So happy about any ideas, even if its just a gut feeling :). There are various points. I would like to mostly ignore the technical ones, but I am listing them anyway here: * Scalars are faster (although that can be optimized likely) * Scalars have a lower memory footprint * The current implementation incurs a technical debt in NumPy. (I do not think that is a general issue, though. We could automatically create scalars for each new datatype probably.) Advantages of having no scalars: * No need to keep track of scalars to preserve them in ufuncs, or libraries using `np.asarray`, do they need `np.asarray_or_scalar`? (or decide they return always arrays, although ufuncs may not) * Seems simpler in many ways, you always know the output will be an array if it has to do with NumPy. Advantages of having scalars: * Scalars are immutable and we are used to them from Python. A 0-D array cannot be used as a dictionary key consistently [1]. I.e. without scalars as first class citizen `dict[arr1d[0]]` cannot work, `dict[arr1d[0].item()]` may (if `.item()` is defined, and e.g. `dict[arr1d[0].frozen()]` could make a copy to work. [2] * Object arrays as we have them now make sense, `arr1d[0]` can reasonably return a Python object. I.e. arrays feel more like container if you can take elements out easily. Could go both ways: * Scalar math `scalar = arr1d[0]; scalar += 1` modifies the array without scalars. With scalars `arr1d[0, ...]` clarifies the meaning. (In principle it is good to never use `arr2d[0]` to get a 1D slice, probably more-so if scalars exist.) From a usability perspective, one could argue that if the dimension of the array one is indexing into is known and the user isn't advanced, then the behavior expected is one of scalars and not 0D arrays. If, however, the input dimension is unknown, then the behavior switch at 0D and the need for an extra ellipsis to ensure array-ness makes things confusing to regular users. I am file with the current behavior of indexing, as anything else would likely be a large backwards-compat break. Note: array-scalars (the current NumPy scalars) are not useful in my opinion [3]. A scalar should not be indexed or have a shape. I do not believe in scalars pretending to be arrays. I personally tend towards liking scalars. If Python was a language where the array (array-programming) concept was ingrained into the language itself, I would lean the other way. But users are used to scalars, and they "put" scalars into arrays. Array objects are in some ways strange in Python, and I feel not having scalars detaches them further. Having scalars, however also means we should preserve them. I feel in principle that is actually fairly straight forward. E.g. for ufuncs: * np.add(scalar, scalar) -> scalar * np.add.reduce(arr, axis=None) -> scalar * np.add.reduce(arr, axis=1) -> array (even if arr is 1d) * np.add.reduce(scalar, axis=()) -> array I love this idea. Of course libraries that do `np.asarray` would/could basically chose to not preserve scalars: Their signature is defined as taking strictly array input. Cheers, Sebastian [0] At best this can be a vision to decide which way they may evolve. [1] E.g. PyTorch uses `hash(tensor) == id(tensor)` which is arguably strange. E.g. Quantity defines hash correctly, but does not fully ensure immutability for 0-D Quantities. Ensuring immutability in a world where "views" are a central concept requires a write-only copy. [2] Arguably `.item()` would always return a scalar, but it would be a second class citizen. (Although if it returns a scalar, at least we already have a scalar implementation.) [3] They are necessary due to technical debt for NumPy datatypes though. _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
![](https://secure.gravatar.com/avatar/71832763447894e7c7f3f64bfd19c13f.jpg?s=120&d=mm&r=g)
I have some thoughts on scalars from playing with ndarray ducktypes (__array_function__), eg a MaskedArray ndarray-ducktype, for which I wanted an associated "MaskedScalar" type. In summary, the ways scalars currently work makes ducktyping (duck-scalars) difficult: * numpy scalar types are not subclassable, so my duck-scalars aren't subclasses of numpy scalars and aren't in the type hierarchy * even if scalars were subclassable, I would have to subclass each scalar datatype individually to make masked versions * lots of code checks `np.isinstance(var, np.float64)` which breaks for my duck-scalars * it was difficult to distinguish between a duck-scalar and a duck-0d array. The method I used in the end seems hacky. This has led to some daydreams about how scalars should work, and also led me last to read through your NEPs 40/41 with specific focus on what you said about scalars, and was about to post there until I saw this discussion. I agree with what you said in the NEPs about not making scalars be dtype instances. Here is what ducktypes led me to: If we are able to do something like define a `np.numpy_scalar` type covering all numpy scalars, which has a `.dtype` attribute like you describe in the NEPs, then that would seem to solve the ducktype problems above. Ducktype implementors would need to make a "duck-scalar" type in parallel to their "duck-ndarray" type, but I found that to be pretty easy using an abstract class in my MaskedArray ducktype, since the MaskedArray and MaskedScalar share a lot of behavior. A numpy_scalar type would also help solve some object-array problems if the object scalars are wrapped in the np_scalar type. A long time ago I started to try to fix up various funny/strange behaviors of object datatypes, but there are lots of special cases, and the main problem was that the returned objects (eg from indexing) were not numpy types and did not support numpy attributes or indexing. Wrapping the returned object in `np.numpy_scalar` might add an extra slight annoyance to people who want to unwrap the object, but I think it would make object arrays less buggy and make code using object arrays easier to reason about and debug. Finally, a few random votes/comments based on the other emails on the list: I think scalars have a place in numpy (rather than just reusing 0d arrays), since there is a clear use in having hashable, immutable scalars. Structured scalars should probably be immutable. I agree with your suggestion that scalars should not be indexable. Thus, my duck-scalars (and proposed numpy_scalar) would not be indexable. However, I think they should encode their datatype though a .dtype attribute like ndarrays, rather than by inheritance. Also, something to think about is that currently numpy scalars satisfy the property `isinstance(np.float64(1), float)`, i.e they are within the python numerical type hierarchy. 0d arrays do not have this property. My proposal above would break this. I'm not sure what to think about whether this is a good property to maintain or not. Cheers, Allan On 2/21/20 8:37 PM, Sebastian Berg wrote:
![](https://secure.gravatar.com/avatar/5dde29b54a3f1b76b2541d0a4a9b232c.jpg?s=120&d=mm&r=g)
I've always found the duality of zero-d arrays an scalars confusing, and I'm sure I'm not alone. Having both is just plain weird. But, backward compatibility aside, could we have ONLY Scalars? When we index into an array, the dimensionality is reduced by one, so indexing into a 1D array has to get us something: but the zero-d array is a really weird object -- do we really need it? There is certainly a need for more numpy-like scalars: more than the built in data types, and some handy attributes and methods, like dtype, .itemsize, etc. But could we make an enhanced scalar that had everything we actually need from a zero-d array? The key point would be mutability -- but do we really need mutable scalars? I can't think of any time I've needed that, when I couldn't have used a 1-d array of length 1. Is there a use case for zero-d arrays that could not be met with an enhanced scalar? -CHB On Mon, Feb 24, 2020 at 12:30 PM Allan Haldane <allanhaldane@gmail.com> wrote:
-- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
![](https://secure.gravatar.com/avatar/b4f6d4f8b501cb05fd054944a166a121.jpg?s=120&d=mm&r=g)
On Mon, 2020-03-23 at 11:45 -0700, Chris Barker wrote:
I guess so, it is a tricky situation, and I do not really have an answer.
Well, it is hard to write functions that work on N-Dimensions (where N can be 0), if the 0-D array does not exist. You can get away with scalars in most cases, because they pretend to be arrays in most cases (aside from mutability). But I am pretty sure we have a bunch of cases that need `res = np.asarray(res)` simply because `res` is N-D but could then be silently converted to a scalar. E.g. see https://github.com/numpy/numpy/issues/13105 for an issue about this (although it does not actually list any specific problems). - Sebastian
![](https://secure.gravatar.com/avatar/5dde29b54a3f1b76b2541d0a4a9b232c.jpg?s=120&d=mm&r=g)
sorry to have fallen off the numpy grid for a bit, but: On Mon, Mar 23, 2020 at 1:37 PM Sebastian Berg <sebastian@sipsolutions.net> wrote:
On Mon, 2020-03-23 at 11:45 -0700, Chris Barker wrote:
But, backward compatibility aside, could we have ONLY Scalars?
I'm not sure this is insolvable (again, backwards compatibility aside) -- after all, one of the key issues is that it's undetermined what the rank should be of: array(a_scalar) -- 0-d is the only unambiguous answer, but then it's not really an array in the usual sense anyway. So in theory, we could not allow that conversion without specifying a rank. at the end of the day, there has to be some endpoint on how far you can reduce the rank of an array and have it work -- why not have 1 be the lower limit? -CHB
-- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
![](https://secure.gravatar.com/avatar/b4f6d4f8b501cb05fd054944a166a121.jpg?s=120&d=mm&r=g)
On Wed, 2020-04-08 at 12:37 -0700, Chris Barker wrote:
So as a (silly) example, the following does not generalize to 0d, even though it should: def weird_normalize_by_trace_inplace(stacked_matrices) """Devides matrices by their trace but retains sign (works in-place, and thus e.g. not for integer arrays) Parameters ---------- stacked_matrices : (..., N, M) ndarray """ assert stacked_matrices.shape[-1] == stacked_matrices.shape[-2] trace = np.trace(stacked_matrices, axis1=-2, axis2=-1) trace[trace < 0] *= -1 stacked_matrices /= trace Sure that function does not make sense and you could rewrite it, but the fact is that in that function you want to conditionally modify trace in-place, but trace can be 0d and the "conditional" modification breaks down. - Sebastian
![](https://secure.gravatar.com/avatar/5dde29b54a3f1b76b2541d0a4a9b232c.jpg?s=120&d=mm&r=g)
On Wed, Apr 8, 2020 at 1:17 PM Sebastian Berg <sebastian@sipsolutions.net> wrote:
I guess that's what I'm getting at -- there is always an endpoint to reducing the rank. a function that's designed to work on a "stack" of something doesn't have to work on a single something, when it can, instead, work on a "stack" of hight one. Isn't the trace of a matrix always a scalar? and thus the trace(s) of a stack of matrixes would always be 1-D? So that function should do something like: stacked_matrixes.shape = (-1, M, M) yes? and then it would always work. Again, backwards compatibility, but there is a reason the np.atleast_*() functions exist -- you often need to make sure your inputs have the dimensionality expected. -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
participants (8)
-
Allan Haldane
-
Chris Barker
-
Evgeni Burovski
-
Hameer Abbasi
-
josef.pktd@gmail.com
-
Juan Nunez-Iglesias
-
Nathaniel Smith
-
Sebastian Berg