[Numpy-discussion] Advanced indexing: "fancy" vs. orthogonal

Fri Apr 3 19:54:25 EDT 2015

On Apr 1, 2015 2:17 AM, "R Hattersley" <rhattersley at gmail.com> wrote:
>
> There are two different interpretations in common use of how to handle multi-valued (array/sequence) indexes. The numpy style is to consider all multi-valued indices together which allows arbitrary points to be extracted. The orthogonal style (e.g. as provided by netcdf4-python) is to consider each multi-valued index independently.
>
> For example:
>
> >>> type(v)
> <type 'netCDF4.Variable'>
> >>> v.shape
> (240, 37, 49)
> >>> v[(0, 1), (0, 2, 3)].shape
> (2, 3, 49)
> >>> np.array(v)[(0, 1), (0, 2, 3)].shape
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
> IndexError: shape mismatch: indexing arrays could not be broadcast together with shapes (2,) (3,)
>
>
> In a netcdf4-python GitHub issue the authors of various orthogonal indexing packages have been discussing how to distinguish the two behaviours and have currently settled on a boolean __orthogonal_indexing__ attribute.

I guess my feeling is that this attribute is a fine solution to the
wrong problem. If I understand the situation correctly: users are
writing two copies of their indexing code to handle two different
array-duck-types (those that do broadcasting indexing and those that
do Cartesian product indexing), and then have trouble knowing which
set of code to use for a given object. The problem that
__orthogonal_indexing__ solves is that it makes easier to decide which
code to use. It works well for this, great.

But, the real problem here is that we have two different array duck
types that force everyone to write their code twice. This is a
terrible state of affairs! (And exactly analogous to the problems
caused by np.ndarray disagreeing with np.matrix & scipy.sparse about
the the proper definition of *, which PEP 465 may eventually
alleviate.) IMO we should be solving this indexing problem directly,
not applying bandaids to its symptoms, and the way to do that is to
come up with some common duck type that everyone can agree on.

Unfortunately, AFAICT this means our only options here are to have
some kind of backcompat break in numpy, some kind of backcompat break
in pandas, or to do nothing and continue indefinitely with the status
quo where the same indexing operation might silently return different
results depending on the types passed in. All of these options have
real costs for users, and it isn't at all clear to me what the
relative costs will be when we dig into the details of our various
options. So I'd be very happy to see worked out proposals for any or
all of these approaches. It strikes me as really premature to be
issuing proclamations about what changes might be considered. There is
really no danger to *considering* a proposal; the worst case is that
we end up rejecting it anyway, but based on better information.

-n