Indexing NEP draft

Hi all, at scipy discussing with Nathaniel and others, we thought that maybe we can push for orthogonal type indexing into numpy. Now with the new version out and some other discussions done, I thought it is time to pick it up :). The basic ideas are twofold. First make indexing easier and less confusing for starters (and advanced users also), and second improve interoperability with projects such as xray for whom orthogonal/outer type indexing makes more sense. I have started working on: 1. A preliminary draft of an NEP you can view at https://github.com/numpy/numpy/pull/6256/files?short_path=01e4dd9#diff-01e4d... or at the end of this mail. 2. A preliminary implementation of `oindex` attribute with orthogonal/outer style indexing in https://github.com/numpy/numpy/pull/6075 which you can try out by cloning numpy and then running from the source dir: git fetch upstream pull/6075/head:pr-6075 && git checkout pr-6075; python runtests.py --ipython This will fetch my PR, switch to the branch and open an interactive ipython shell where you will be able to do arr.oindex[]. Note that I consider the NEP quite preliminary in many parts, and it may still be very confusing unless you are well versed with current advanced indexing. There are some longer examples comparing the different styles and another "example" which tries to show a "use case" example going from simpler to more complex indexing operations. Any comments are very welcome, and if it is "I don't understand a word" :). I know it is probably too short and, at least without examples, not easy to understand. Best, Sebastian ================================================================================== The current NEP draft: ========================================================== Implementing intuitive and full featured advanced indexing ========================================================== :Author: Sebastian Berg :Date: 2015-08-27 :Status: draft Executive summary ================= Advanced indexing with multiple array indices is typically confusing to both new, and in many cases even old, users of NumPy. To avoid this problem and allow for more and clearer features, we propose to: 1. Introduce ``arr.oindex[indices]`` which allows advanced indices, but uses outer indexing logic. 2. Introduce ``arr.vindex[indices]`` which use the current "vectorized"/broadcasted logic but with two differences from fancy indexing: 1. Boolean indices always use the outer indexing logic. (Multi dimensional booleans should be allowed). 2. The integer index result dimensions are always the first axes of the result array. No transpose is done, even for a single integer array index. 3. Vanilla indexing on the array will only give warnings and eventually errors either: * when there is ambiguity between legacy fancy and outer indexing (note that ``arr[[1, 2], :, 0]`` is such a case, an integer can be the "second" integer index array), * when any integer index array is present (possibly additional for more then one boolean index array). These constraints are sufficient for making indexing generally consistent with expectations and providing a less surprising learning curve with ``oindex``. Note that all things mentioned here apply both for assignment as well as subscription. Understanding these details is *not* easy. The `Examples` section gives code examples. And the hopefully easier `Motivational Example` provides some motivational use-cases for the general ideas and is likely a good start for anyone not intimately familiar with advanced indexing. Motivation ========== Old style advanced indexing with multiple array (boolean or integer) indices, also called "fancy indexing", tends to be very confusing for new users. While fancy (or legacy) indexing is useful in many cases one would naively assume that the result of multiple 1-d ranges is analogous to multiple slices along each dimension (also called "outer indexing"). However, legacy fancy indexing with multiple arrays broadcasts these arrays into a single index over multiple dimensions. There are three main points of confusion when multiple array indices are involved: 1. Most new users will usually expect outer indexing (consistent with slicing). This is also the most common way of handling this in other packages or languages. 2. The axes introduced by the array indices are at the front, unless all array indices are consecutive, in which case one can deduce where the user "expects" them to be: * `arr[:, [0, 1], :, [0, 1]]` will have the first dimension shaped 2. * `arr[:, [0, 1], [0, 1]]` will have the second dimension shaped 2. 3. When a boolean array index is mixed with another boolean or integer array, the result is very hard to understand (the boolean array is converted to integer array indices and then broadcast), and hardly useful. There is no well defined broadcast for booleans, so that boolean indices are logically always "``outer``" type indices. Proposed rules ============== From the three problems noted above some expectations for NumPy can be deduced: 1. There should be a prominent outer/orthogonal indexing method such as ``arr.oindex[indices]``. 2. Considering how confusing fancy indexing can be, it should only occur explicitly (e.g. ``arr.vindex[indices]``) 3. A new ``arr.vindex[indices]`` method, would not be tied to the confusing transpose rules of fancy indexing (which is for example needed for the simple case of a single advanced index). Thus, it no transposing should be done. The axes of the advanced indices are always inserted at the front, even for a single index. 4. Boolean indexing is conceptionally outer indexing. A broadcasting together with other advanced indices in the manner of legacy "fancy indexing" is generally not helpful or well defined. A user who wishes the "``nonzero``" plus broadcast behaviour can thus be expected to do this manually. Using this rule, a single boolean index can index into multiple dimensions at once. 5. An ``arr.lindex`` or ``arr.findex`` should likely be implemented to allow legacy fancy indexing indefinetly. This also gives a simple way to update fancy indexing code making deprecations to vanilla indexing easier. 6. Vanilla indexing ``arr[...]`` could return an error for ambiguous cases. For the beginning, this probably means cases where ``arr[ind]`` and ``arr.oindex[ind]`` return different results gives deprecation warnings. However, the exact rules for this (especially the final behaviour) are not quite clear in cases such as ``arr[0, :, index_arr]``. All other rules for indexing are identical. Open Questions ============== 1. Especially for the new indexing attributes ``oindex`` and ``vindex``, a case could be made to not implicitly add an ``Ellipsis`` index if necessary. This helps finding bugs since a too high dimensional array can be caught. (I am in favor for this, but doubt we should think about this for vanilla indexing.) 2. The names ``oindex`` and ``vindex`` are just suggestions at the time of writing this, another name NumPy has used for something like ``oindex`` is ``np.ix_``. See also below. 3. It would be possible to limit the use of boolean indices in ``vindex``, assuming that they are rare and to some degree special. (This would make implementation simpler, but I do not see a big reason.) 4. ``oindex`` and ``vindex`` could always return copies, even when no array operation occurs. One argument for using the same rules is that this way ``oindex`` can be used as a general index replacement. (There is likely no big reason for this, however, there is one reason: ``arr.vindex[array_scalar, ...]`` can occur, where ``arr_scalar`` should be a 0-D array. Copying always "fixes" the possible inconsistency.) 5. The final state to morph indexing in is not fixed in this PEP. It is for example possible that `arr[index]`` will be equivalent to ``arr.oindex`` at some point in the future. Since such a change will take years, it seems unnecessary to make specific decisions now. 6. Proposed changes to vanilla indexing could be postponed indefinetly or not taken in order to not break or force fixing of existing code bases. 7. Possible the ``vindex`` combination with boolean indexing could be rethought or not allowed at all for simplicity. Necessary changes to NumPy ========================== Implement ``arr.oindex`` and ``arr.vindex`` objects to allow these indexing operations and create warnings (and eventually deprecate) ambiguous direct indexing operations on arrays. Alternative Names ================= Possible names suggested (more suggestions will be added). ============== ======== ======= **Orthogonal** oindex oix **Vectorized** vindex fix **Legacy** l/findex ============== ======== ======= Examples ======== Since the various kinds of indexing is hard to grasp in many cases, these examples hopefully give some more insights. Note that they are all in terms of shape. All original dimensions start with 5, advanced indexing inserts less long dimensions. (Note that ``...`` or ``Ellipsis`` mostly inserts as many slices as needed to index the full array). These examples may be hard to grasp without working knowledge of advanced indexing as of NumPy 1.9. Example array:: >>> arr = np.ones((5, 6, 7, 8)) Legacy fancy indexing --------------------- Single index is transposed (this is the same for all indexing types):: >>> arr[[0], ...].shape (1, 6, 7, 8) >>> arr[:, [0], ...].shape (5, 1, 7, 8) Multiple indices are transposed *if* consecutive:: >>> arr[:, [0], [0], :].shape # future error (5, 1, 7) >>> arr[:, [0], :, [0]].shape # future error (1, 5, 6) It is important to note that a scalar *is* integer array index in this sense (and gets broadcasted with the other advanced index):: >>> arr[:, [0], 0, :].shape # future error (scalar is "fancy") (5, 1, 7) >>> arr[:, [0], :, 0].shape # future error (scalar is "fancy") (1, 5, 6) Single boolean index can act on multiple dimensions (especially the whole array). It has to match (as of 1.10. a deprecation warning) the dimensions. The boolean index is otherwise identical to (multiple consecutive) integer array indices:: >>> # Create boolean index with one True value for the last two dimensions: >>> bindx = np.zeros((7, 8), dtype=np.bool_) >>> bindx[[0, 0]] = True >>> arr[:, 0, bindx].shape (5, 1) >>> arr[0, :, bindx].shape (1, 6) The combination with anything that is not a scalar is confusing, e.g.:: >>> arr[[0], :, bindx].shape # bindx result broadcasts with [0] (1, 6) >>> arr[:, [0, 1], bindx] # IndexError Outer indexing -------------- Multiple indices are "orthogonal" and their result axes are inserted at the same place (they are not broadcasted):: >>> arr.oindex[:, [0], [0, 1], :].shape (5, 1, 2, 8) >>> arr.oindex[:, [0], :, [0, 1]].shape (5, 1, 7, 2) >>> arr.oindex[:, [0], 0, :].shape (5, 1, 8) >>> arr.oindex[:, [0], :, 0].shape (5, 1, 7) Boolean indices results are always inserted where the index is:: >>> # Create boolean index with one True value for the last two dimensions: >>> bindx = np.zeros((7, 8), dtype=np.bool_) >>> bindx[[0, 0]] = True >>> arr.oindex[:, 0, bindx].shape (5, 1) >>> arr.oindex[0, :, bindx].shape (6, 1) Nothing changed in the presence of other advanced indices since:: >>> arr.oindex[[0], :, bindx].shape (1, 6, 1) >>> arr.oindex[:, [0, 1], bindx] (5, 2, 1) Vectorized/inner indexing ------------------------- Multiple indices are broadcasted and iterated as one like fancy indexing, but the new axes area always inserted at the front:: >>> arr.vindex[:, [0], [0, 1], :].shape (2, 5, 8) >>> arr.vindex[:, [0], :, [0, 1]].shape (2, 5, 7) >>> arr.vindex[:, [0], 0, :].shape (1, 5, 8) >>> arr.vindex[:, [0], :, 0].shape (1, 5, 7) Boolean indices results are always inserted where the index is, exactly as in ``oindex`` given how specific they are to the axes they operate on:: >>> # Create boolean index with one True value for the last two dimensions: >>> bindx = np.zeros((7, 8), dtype=np.bool_) >>> bindx[[0, 0]] = True >>> arr.vindex[:, 0, bindx].shape (5, 1) >>> arr.vindex[0, :, bindx].shape (6, 1) But other advanced indices are again transposed to the front:: >>> arr.vindex[[0], :, bindx].shape (1, 6, 1) >>> arr.vindex[:, [0, 1], bindx] (2, 5, 1) Related Questions ================= There exist a further indexing or indexing like method. That is the inverse of a command such as ``np.argmin(arr, axis=axis)``, to pick the specific elements *along* an axis given an array of (at least typically) the same size. Doing such a thing with the indexing notation is not quite straight forward since the axis on which to pick elements has to be supplied. One plausible solution would be to create a function (calling it pick here for simplicity):: np.pick(arr, index_arr, axis=axis) where ``index_arr`` has to be the same shape as ``arr`` except along ``axis``. One could imagine that this can be useful together with other indexing types, but such a function may be sufficient and extra information needed seems easier to pass using a function convention. Another option would be to allow an argument such as ``compress_axes=None`` (just to have some name) which maps the axes from the index array to the new array with ``None`` signaling a new axis. Also keepdims could be added as a simple default. (Note that the use of axis is not compatible to ``np.take`` for an ``index_arr`` which is not zero or one dimensional.) Another solution is to provide functions or features to the ``arg*``functions to map this to the equivalent ``vindex`` indexing operation. Motivational Example ==================== Imagine having a data acquisition software storing ``D`` channels and ``N`` datapoints along the time. She stores this into an ``(N, D)`` shaped array. During data analysis, we needs to fetch a pool of channels, for example to calculate a mean over them. This data can be faked using:: >>> arr = np.random.random((100, 10)) Now one may remember indexing with an integer array and find the correct code:: >>> group = arr[:, [2, 5]] >>> mean_value = arr.mean() However, assume that there were some specific time points (first dimension of the data) that need to be specially considered. These time points are already known and given by:: >>> interesting_times = np.array([1, 5, 8, 10], dtype=np.intp) Now to fetch them, we may try to modify the previous code:: >>> group_at_it = arr[interesting_times, [2, 5]] IndexError: Ambiguous index, use `.oindex` or `.vindex` An error such as this will point to read up the indexing documentation. This should make it clear, that ``oindex`` behaves more like slicing. So, out of the different methods it is the obvious choice (for now, this is a shape mismatch, but that could possibly also mention ``oindex``):: >>> group_at_it = arr.oindex[interesting_times, [2, 5]] Now of course one could also have used ``vindex``, but it is much less obvious how to achieve the right thing!:: >>> reshaped_times = interesting_times[:, np.newaxis] >>> group_at_it = arr.vindex[reshaped_times, [2, 5]] One may find, that for example our data is corrupt in some places. So, we need to replace these values by zero (or anything else) for these times. The first column may for example give the necessary information, so that changing the values becomes easy remembering boolean indexing:: >>> bad_data = arr[0] > 0.5 >>> arr[bad_data, :] = 0 Again, however, the columns may need to be handled more individually (but in groups), and the ``oindex`` attribute works well:: >>> arr.oindex[bad_data, [2, 5]] = 0 Note that it would be very hard to do this using legacy fancy indexing. The only way would be to create an integer array first:: >>> bad_data_indx = np.nonzero(bad_data)[0] >>> bad_data_indx_reshaped = bad_data_indx[:, np.newaxis] >>> arr[bad_data_indx_reshaped, [2, 5]] In any case we can use only ``oindex`` to do all of this without getting into any trouble or confused by the whole complexity of advanced indexing. But, some new features are added to the data acquisition. Different sensors have to be used depending on the times. Let us assume we already have created an array of indices:: >>> correct_sensors = np.random.randint(10, size=(100, 2)) Which lists for each time the two correct sensors in an ``(N, 2)`` array. A first try to achieve this may be ``arr[:, correct_sensors]`` and this does not work. It should be clear quickly that slicing cannot achieve the desired thing. But hopefully users will remember that there is ``vindex`` as a more powerful and flexible approach to advanced indexing. One may, if trying ``vindex`` randomly, be confused about:: >>> new_arr = arr.vindex[:, correct_sensors] which is neither the same, nor the correct result (see transposing rules)! This is because slicing works still the same in ``vindex``. However, reading the documentation and examples, one can hopefully quickly find the desired solution:: >>> rows = np.arange(len(arr)) >>> rows = rows[:, np.newaxis] # make shape fit with correct_sensors >>> new_arr = arr.vindex[rows, correct_sensors] At this point we have left the straight forward world of ``oindex`` but can do random picking of any element from the array. Note that in the last example a method such as mentioned in the ``Related Questions`` section could be more straight forward. But this approach is even more flexible, since ``rows`` does not have to be a simple ``arange``, but could be ``intersting_times``:: >>> correct_sensors_at_it = correct_sensors[interesting_times, :] >>> interesting_times_reshaped = interesting_times[:, np.newaxis] >>> new_arr_it = arr[interesting_times_reshaped, correct_sensors_at_it] Truly complex situation would arise now if you would for example pool ``L`` experiments into an array shaped ``(L, N, D)``. But for ``oindex`` this should not result into surprises. ``vindex``, being more powerful, will quite certainly create some confusion in this case but also cover pretty much all eventualities.

Hey, small update on this. The NEP draft has not changed much, but you can now try the full power of the proposed new indexing types [1]: * arr.oindex[...] # orthogonal/outer indexing * arr.vindex[...] # vectorized (like fancy, but different ;)) * arr.lindex[...] # legacy/fancy indexing with my pull request at https://github.com/numpy/numpy/pull/6075 You can try it locally by cloning the numpy github repository and then running from the source dir: git fetch upstream pull/6075/head:pr-6075 && git checkout pr-6075; python runtests.py --ipython # Inside ipython: import warnings; warnings.simplefilter("always") The examples from the NEP at should all run fine, you can find the NEP draft at: https://github.com/numpy/numpy/pull/6256/files?short_path=01e4dd9#diff-01e4d... I would be most happy about any comments or suggestions! - Sebastian [1] Modulo possible bugs, there is not test suit yet.... On Mi, 2015-11-11 at 11:02 +0100, Sebastian Berg wrote:
Hi all,
at scipy discussing with Nathaniel and others, we thought that maybe we can push for orthogonal type indexing into numpy. Now with the new version out and some other discussions done, I thought it is time to pick it up :).
The basic ideas are twofold. First make indexing easier and less confusing for starters (and advanced users also), and second improve interoperability with projects such as xray for whom orthogonal/outer type indexing makes more sense.
I have started working on:
1. A preliminary draft of an NEP you can view at https://github.com/numpy/numpy/pull/6256/files?short_path=01e4dd9#diff-01e4d... or at the end of this mail.
2. A preliminary implementation of `oindex` attribute with orthogonal/outer style indexing in https://github.com/numpy/numpy/pull/6075 which you can try out by cloning numpy and then running from the source dir:
git fetch upstream pull/6075/head:pr-6075 && git checkout pr-6075; python runtests.py --ipython
This will fetch my PR, switch to the branch and open an interactive ipython shell where you will be able to do arr.oindex[].
Note that I consider the NEP quite preliminary in many parts, and it may still be very confusing unless you are well versed with current advanced indexing. There are some longer examples comparing the different styles and another "example" which tries to show a "use case" example going from simpler to more complex indexing operations. Any comments are very welcome, and if it is "I don't understand a word" :). I know it is probably too short and, at least without examples, not easy to understand.
Best,
Sebastian
================================================================================== The current NEP draft:
========================================================== Implementing intuitive and full featured advanced indexing ==========================================================
:Author: Sebastian Berg :Date: 2015-08-27 :Status: draft
Executive summary =================
Advanced indexing with multiple array indices is typically confusing to both new, and in many cases even old, users of NumPy. To avoid this problem and allow for more and clearer features, we propose to:
1. Introduce ``arr.oindex[indices]`` which allows advanced indices, but uses outer indexing logic. 2. Introduce ``arr.vindex[indices]`` which use the current "vectorized"/broadcasted logic but with two differences from fancy indexing:
1. Boolean indices always use the outer indexing logic. (Multi dimensional booleans should be allowed). 2. The integer index result dimensions are always the first axes of the result array. No transpose is done, even for a single integer array index.
3. Vanilla indexing on the array will only give warnings and eventually errors either:
* when there is ambiguity between legacy fancy and outer indexing (note that ``arr[[1, 2], :, 0]`` is such a case, an integer can be the "second" integer index array), * when any integer index array is present (possibly additional for more then one boolean index array).
These constraints are sufficient for making indexing generally consistent with expectations and providing a less surprising learning curve with ``oindex``.
Note that all things mentioned here apply both for assignment as well as subscription.
Understanding these details is *not* easy. The `Examples` section gives code examples. And the hopefully easier `Motivational Example` provides some motivational use-cases for the general ideas and is likely a good start for anyone not intimately familiar with advanced indexing.
Motivation ==========
Old style advanced indexing with multiple array (boolean or integer) indices, also called "fancy indexing", tends to be very confusing for new users. While fancy (or legacy) indexing is useful in many cases one would naively assume that the result of multiple 1-d ranges is analogous to multiple slices along each dimension (also called "outer indexing").
However, legacy fancy indexing with multiple arrays broadcasts these arrays into a single index over multiple dimensions. There are three main points of confusion when multiple array indices are involved:
1. Most new users will usually expect outer indexing (consistent with slicing). This is also the most common way of handling this in other packages or languages. 2. The axes introduced by the array indices are at the front, unless all array indices are consecutive, in which case one can deduce where the user "expects" them to be:
* `arr[:, [0, 1], :, [0, 1]]` will have the first dimension shaped 2. * `arr[:, [0, 1], [0, 1]]` will have the second dimension shaped 2.
3. When a boolean array index is mixed with another boolean or integer array, the result is very hard to understand (the boolean array is converted to integer array indices and then broadcast), and hardly useful. There is no well defined broadcast for booleans, so that boolean indices are logically always "``outer``" type indices.
Proposed rules ==============
From the three problems noted above some expectations for NumPy can be deduced:
1. There should be a prominent outer/orthogonal indexing method such as ``arr.oindex[indices]``. 2. Considering how confusing fancy indexing can be, it should only occur explicitly (e.g. ``arr.vindex[indices]``) 3. A new ``arr.vindex[indices]`` method, would not be tied to the confusing transpose rules of fancy indexing (which is for example needed for the simple case of a single advanced index). Thus, it no transposing should be done. The axes of the advanced indices are always inserted at the front, even for a single index. 4. Boolean indexing is conceptionally outer indexing. A broadcasting together with other advanced indices in the manner of legacy "fancy indexing" is generally not helpful or well defined. A user who wishes the "``nonzero``" plus broadcast behaviour can thus be expected to do this manually. Using this rule, a single boolean index can index into multiple dimensions at once. 5. An ``arr.lindex`` or ``arr.findex`` should likely be implemented to allow legacy fancy indexing indefinetly. This also gives a simple way to update fancy indexing code making deprecations to vanilla indexing easier. 6. Vanilla indexing ``arr[...]`` could return an error for ambiguous cases. For the beginning, this probably means cases where ``arr[ind]`` and ``arr.oindex[ind]`` return different results gives deprecation warnings. However, the exact rules for this (especially the final behaviour) are not quite clear in cases such as ``arr[0, :, index_arr]``.
All other rules for indexing are identical.
Open Questions ==============
1. Especially for the new indexing attributes ``oindex`` and ``vindex``, a case could be made to not implicitly add an ``Ellipsis`` index if necessary. This helps finding bugs since a too high dimensional array can be caught. (I am in favor for this, but doubt we should think about this for vanilla indexing.)
2. The names ``oindex`` and ``vindex`` are just suggestions at the time of writing this, another name NumPy has used for something like ``oindex`` is ``np.ix_``. See also below.
3. It would be possible to limit the use of boolean indices in ``vindex``, assuming that they are rare and to some degree special. (This would make implementation simpler, but I do not see a big reason.)
4. ``oindex`` and ``vindex`` could always return copies, even when no array operation occurs. One argument for using the same rules is that this way ``oindex`` can be used as a general index replacement. (There is likely no big reason for this, however, there is one reason: ``arr.vindex[array_scalar, ...]`` can occur, where ``arr_scalar`` should be a 0-D array. Copying always "fixes" the possible inconsistency.)
5. The final state to morph indexing in is not fixed in this PEP. It is for example possible that `arr[index]`` will be equivalent to ``arr.oindex`` at some point in the future. Since such a change will take years, it seems unnecessary to make specific decisions now.
6. Proposed changes to vanilla indexing could be postponed indefinetly or not taken in order to not break or force fixing of existing code bases.
7. Possible the ``vindex`` combination with boolean indexing could be rethought or not allowed at all for simplicity.
Necessary changes to NumPy ==========================
Implement ``arr.oindex`` and ``arr.vindex`` objects to allow these indexing operations and create warnings (and eventually deprecate) ambiguous direct indexing operations on arrays.
Alternative Names =================
Possible names suggested (more suggestions will be added).
============== ======== ======= **Orthogonal** oindex oix **Vectorized** vindex fix **Legacy** l/findex ============== ======== =======
Examples ========
Since the various kinds of indexing is hard to grasp in many cases, these examples hopefully give some more insights. Note that they are all in terms of shape. All original dimensions start with 5, advanced indexing inserts less long dimensions. (Note that ``...`` or ``Ellipsis`` mostly inserts as many slices as needed to index the full array). These examples may be hard to grasp without working knowledge of advanced indexing as of NumPy 1.9.
Example array::
>>> arr = np.ones((5, 6, 7, 8))
Legacy fancy indexing ---------------------
Single index is transposed (this is the same for all indexing types)::
>>> arr[[0], ...].shape (1, 6, 7, 8) >>> arr[:, [0], ...].shape (5, 1, 7, 8)
Multiple indices are transposed *if* consecutive::
>>> arr[:, [0], [0], :].shape # future error (5, 1, 7) >>> arr[:, [0], :, [0]].shape # future error (1, 5, 6)
It is important to note that a scalar *is* integer array index in this sense (and gets broadcasted with the other advanced index)::
>>> arr[:, [0], 0, :].shape # future error (scalar is "fancy") (5, 1, 7) >>> arr[:, [0], :, 0].shape # future error (scalar is "fancy") (1, 5, 6)
Single boolean index can act on multiple dimensions (especially the whole array). It has to match (as of 1.10. a deprecation warning) the dimensions. The boolean index is otherwise identical to (multiple consecutive) integer array indices::
>>> # Create boolean index with one True value for the last two dimensions: >>> bindx = np.zeros((7, 8), dtype=np.bool_) >>> bindx[[0, 0]] = True >>> arr[:, 0, bindx].shape (5, 1) >>> arr[0, :, bindx].shape (1, 6)
The combination with anything that is not a scalar is confusing, e.g.::
>>> arr[[0], :, bindx].shape # bindx result broadcasts with [0] (1, 6) >>> arr[:, [0, 1], bindx] # IndexError
Outer indexing --------------
Multiple indices are "orthogonal" and their result axes are inserted at the same place (they are not broadcasted)::
>>> arr.oindex[:, [0], [0, 1], :].shape (5, 1, 2, 8) >>> arr.oindex[:, [0], :, [0, 1]].shape (5, 1, 7, 2) >>> arr.oindex[:, [0], 0, :].shape (5, 1, 8) >>> arr.oindex[:, [0], :, 0].shape (5, 1, 7)
Boolean indices results are always inserted where the index is::
>>> # Create boolean index with one True value for the last two dimensions: >>> bindx = np.zeros((7, 8), dtype=np.bool_) >>> bindx[[0, 0]] = True >>> arr.oindex[:, 0, bindx].shape (5, 1) >>> arr.oindex[0, :, bindx].shape (6, 1)
Nothing changed in the presence of other advanced indices since::
>>> arr.oindex[[0], :, bindx].shape (1, 6, 1) >>> arr.oindex[:, [0, 1], bindx] (5, 2, 1)
Vectorized/inner indexing -------------------------
Multiple indices are broadcasted and iterated as one like fancy indexing, but the new axes area always inserted at the front::
>>> arr.vindex[:, [0], [0, 1], :].shape (2, 5, 8) >>> arr.vindex[:, [0], :, [0, 1]].shape (2, 5, 7) >>> arr.vindex[:, [0], 0, :].shape (1, 5, 8) >>> arr.vindex[:, [0], :, 0].shape (1, 5, 7)
Boolean indices results are always inserted where the index is, exactly as in ``oindex`` given how specific they are to the axes they operate on::
>>> # Create boolean index with one True value for the last two dimensions: >>> bindx = np.zeros((7, 8), dtype=np.bool_) >>> bindx[[0, 0]] = True >>> arr.vindex[:, 0, bindx].shape (5, 1) >>> arr.vindex[0, :, bindx].shape (6, 1)
But other advanced indices are again transposed to the front::
>>> arr.vindex[[0], :, bindx].shape (1, 6, 1) >>> arr.vindex[:, [0, 1], bindx] (2, 5, 1)
Related Questions =================
There exist a further indexing or indexing like method. That is the inverse of a command such as ``np.argmin(arr, axis=axis)``, to pick the specific elements *along* an axis given an array of (at least typically) the same size.
Doing such a thing with the indexing notation is not quite straight forward since the axis on which to pick elements has to be supplied. One plausible solution would be to create a function (calling it pick here for simplicity)::
np.pick(arr, index_arr, axis=axis)
where ``index_arr`` has to be the same shape as ``arr`` except along ``axis``. One could imagine that this can be useful together with other indexing types, but such a function may be sufficient and extra information needed seems easier to pass using a function convention. Another option would be to allow an argument such as ``compress_axes=None`` (just to have some name) which maps the axes from the index array to the new array with ``None`` signaling a new axis. Also keepdims could be added as a simple default. (Note that the use of axis is not compatible to ``np.take`` for an ``index_arr`` which is not zero or one dimensional.)
Another solution is to provide functions or features to the ``arg*``functions to map this to the equivalent ``vindex`` indexing operation.
Motivational Example ====================
Imagine having a data acquisition software storing ``D`` channels and ``N`` datapoints along the time. She stores this into an ``(N, D)`` shaped array. During data analysis, we needs to fetch a pool of channels, for example to calculate a mean over them.
This data can be faked using::
>>> arr = np.random.random((100, 10))
Now one may remember indexing with an integer array and find the correct code::
>>> group = arr[:, [2, 5]] >>> mean_value = arr.mean()
However, assume that there were some specific time points (first dimension of the data) that need to be specially considered. These time points are already known and given by::
>>> interesting_times = np.array([1, 5, 8, 10], dtype=np.intp)
Now to fetch them, we may try to modify the previous code::
>>> group_at_it = arr[interesting_times, [2, 5]] IndexError: Ambiguous index, use `.oindex` or `.vindex`
An error such as this will point to read up the indexing documentation. This should make it clear, that ``oindex`` behaves more like slicing. So, out of the different methods it is the obvious choice (for now, this is a shape mismatch, but that could possibly also mention ``oindex``)::
>>> group_at_it = arr.oindex[interesting_times, [2, 5]]
Now of course one could also have used ``vindex``, but it is much less obvious how to achieve the right thing!::
>>> reshaped_times = interesting_times[:, np.newaxis] >>> group_at_it = arr.vindex[reshaped_times, [2, 5]]
One may find, that for example our data is corrupt in some places. So, we need to replace these values by zero (or anything else) for these times. The first column may for example give the necessary information, so that changing the values becomes easy remembering boolean indexing::
>>> bad_data = arr[0] > 0.5 >>> arr[bad_data, :] = 0
Again, however, the columns may need to be handled more individually (but in groups), and the ``oindex`` attribute works well::
>>> arr.oindex[bad_data, [2, 5]] = 0
Note that it would be very hard to do this using legacy fancy indexing. The only way would be to create an integer array first::
>>> bad_data_indx = np.nonzero(bad_data)[0] >>> bad_data_indx_reshaped = bad_data_indx[:, np.newaxis] >>> arr[bad_data_indx_reshaped, [2, 5]]
In any case we can use only ``oindex`` to do all of this without getting into any trouble or confused by the whole complexity of advanced indexing.
But, some new features are added to the data acquisition. Different sensors have to be used depending on the times. Let us assume we already have created an array of indices::
>>> correct_sensors = np.random.randint(10, size=(100, 2))
Which lists for each time the two correct sensors in an ``(N, 2)`` array.
A first try to achieve this may be ``arr[:, correct_sensors]`` and this does not work. It should be clear quickly that slicing cannot achieve the desired thing. But hopefully users will remember that there is ``vindex`` as a more powerful and flexible approach to advanced indexing. One may, if trying ``vindex`` randomly, be confused about::
>>> new_arr = arr.vindex[:, correct_sensors]
which is neither the same, nor the correct result (see transposing rules)! This is because slicing works still the same in ``vindex``. However, reading the documentation and examples, one can hopefully quickly find the desired solution::
>>> rows = np.arange(len(arr)) >>> rows = rows[:, np.newaxis] # make shape fit with correct_sensors >>> new_arr = arr.vindex[rows, correct_sensors]
At this point we have left the straight forward world of ``oindex`` but can do random picking of any element from the array. Note that in the last example a method such as mentioned in the ``Related Questions`` section could be more straight forward. But this approach is even more flexible, since ``rows`` does not have to be a simple ``arange``, but could be ``intersting_times``::
>>> correct_sensors_at_it = correct_sensors[interesting_times, :] >>> interesting_times_reshaped = interesting_times[:, np.newaxis] >>> new_arr_it = arr[interesting_times_reshaped, correct_sensors_at_it]
Truly complex situation would arise now if you would for example pool ``L`` experiments into an array shaped ``(L, N, D)``. But for ``oindex`` this should not result into surprises. ``vindex``, being more powerful, will quite certainly create some confusion in this case but also cover pretty much all eventualities.
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion
participants (1)
-
Sebastian Berg