`np.array()`, array-likes, nested sequences and subclasses

Hi all, tl;dr: `np.array()` is somewhat ill-defined, also creating issues for Quantities. In a recent PR I am cementing, and slightly broadening, its definition. So we have to decide how we wish to handle code such as in the long run: np.array([array-like, array-like]) --- Traditionally, we have two meanings of "array-like" as understood by `np.array()` (In the text I use array-like for the second point here): 1. Nested sequences of scalars. 2. A single array-like object, meaning a buffer-interface, an array subclass, a pandas dataframe (`__array__()`), etc. However, the boundaries between these are fuzzy, and over the years became more fuzzy. The reason is that a NumPy array (and many array- likes) are also nested sequences of scalars. I defined the current behaviour slightly clearer in my PR, but by that also subtly broadened it up [0]: 1. Any array-like embedded in the nested-sequences is converted to a NumPy array. [1] (Any array-like is never interpreted as a sequence) 2. Any array-like's elements will be elements of the output. We never enter array-likes recursively (including object arrays). 3. The `subok=True` parameter is implicitly ignored, unless the input is a single ndarray sublcass. Now to the issues at hand: * We should make sure those defintions are good, they mainly cement current behaviour, but if we want to roll back on features, we should do it now. * There are some issues around Quantity and masked arrays, because their "scalars" are (sometimes) 0-D arrays. And they currently rely on NumPy considering them to be scalars. This has its own set of long term issues [2]. For now, I can simply roll the changes to 0-D array behaviour back. But in the mid-to-long run, we have to make a decision, or perpetually live with array subclasses being subtly broken: 1. Define Quantity and Masked arrays as wrong. They must use a special DType, which consistently tells NumPy that the elements cannot simply be copied by converting the Quantity to an array. The up-side is, that it generalizes to N-D. 2. Independently, but partially addressing the Quantity issue, we have to decide what `np.array()` should actually do. A sequence containing array-likes, in most cases is better written using `np.stack()`, but due to the fuzzy boundaries, code like `np.array([dataframe, dataframe])` is probably common. We could try to deprecate though. The downsides to deprecation seem to me that I feel we have to reject viewing array-likes as sequences. To me doing that has its own set of issues. If just that `np.array([arraylike])` seems perfectly reasonable, but may be very slow. - Sebastian [0] It is hard to list how exactly it is broadened up, because the current behaviour has very subtle behaviours, such as actually iterating a `memoryview()`, which does always the same thing, but only works for 1-D memoryviews, and fails for both 0-D and N-D. [1] There are some subtleties which are not important here, such that I do anticipate the possibility of having array-likes which are considered scalars with respect to a given dtype, such as `np.array([poly], dtype=Polynomial)` where a poly object itself is an array-like. [2] Basically: np.array([0d_array], dtype=user_dtype) works, by ending up calling: res[0] = float(0d_array) # quantity.__float__ is used! which works nice for the typical float/int dtype, is tricky to get right for general dtypes (e.g. longdouble/clongdouble). This is a small issue now, but it could become a problem when more user-dtypes are defined.
participants (1)
-
Sebastian Berg