[Numpy-discussion] numpythonically getting elements with the minimum sum

Tue Jan 29 13:07:03 EST 2013

Lluís  writes:

> Sebastian Berg writes:
>> On Tue, 2013-01-29 at 14:53 +0100, Lluís wrote:
>>> Gregor Thalhammer writes:
>>> 
>>> > Am 28.1.2013 um 23:15 schrieb Lluís:
>>> 
>>> >> Hi,
>>> >> 
>>> >> I have a somewhat convoluted N-dimensional array that contains information of a
>>> >> set of experiments.
>>> >> 
>>> >> The last dimension has as many entries as iterations in the experiment (an
>>> >> iterative application), and the penultimate dimension has as many entries as
>>> >> times I have run that experiment; the rest of dimensions describe the features
>>> >> of the experiment:
>>> >> 
>>> >> data.shape == (... indefinite amount of dimensions ..., NUM_RUNS, NUM_ITERATIONS)
>>> >> 
>>> >> So, what I want is to get the data for the best run of each experiment:
>>> >> 
>>> >> best.shape == (... indefinite amount of dimensions ..., NUM_ITERATIONS)
>>> >> 
>>> >> by selecting, for each experiment, the run with the lowest total time (sum of
>>> >> the time of all iterations for that experiment).
>>> >> 
>>> >> 
>>> >> So far I've got the trivial part, but not the final indexing into "data":
>>> >> 
>>> >> dsum = data.sum(axis = -1)
>>> >> dmin = dsum.min(axis = -1)
>>> >> best = data[???]
>>> >> 
>>> >> 
>>> >> I'm sure there must be some numpythonic and generic way to get what I want, but
>>> >> fancy indexing is beating me here :)
>>> 
>>> > Did you have a look at the argmin function? It delivers the indices of the minimum values along an axis. Untested guess:
>>> 
>>> > dmin_idx = argmin(dsum, axis = -1)
>>> > best = data[..., dmin_idx, :]
>>> 
>>> Ah, sorry, my example is incorrect. I was actually using 'argmin', but indexing
>>> with it does not exactly work as I expected:
>>> 
>>> >>> d1.shape
>>> (2, 5, 10)
>>> >>> dsum = d1.sum(axis = -1)
>>> >>> dmin = d1.argmin(axis = -1)
>>> >>> dmin.shape
>>> (2,)
>>> >>> d1_best = d1[...,dmin,:]

>> You need to use fancy indexing. Something like:
>>>>> d1_best = d1[np.arange(2), dmin,:]

>> Because the Ellipsis takes everything from the axis, while you want to
>> pick from multiple axes at the same time. That can be achieved with
>> fancy indexing (indexing with arrays). From another perspective, you
>> want to get rid of two axes in favor of a new one, but a slice/Ellipsis
>> always preserves the axis it works on.

> Nice, thanks. That works for this specific example, but I couldn't get it to
> work with "d1.shape == (1, 2, 16, 5, 10)" (thus "dmin.shape == (1, 2, 16)"):

>>>> def get_best_run (data, field):
>     ...     """Returns the best run."""
>     ...     data = data.view(np.ndarray)
>     ...     assert data.ndim >= 2
>     ...     dsum = data[field].sum(axis=-1)
>     ...     dmin = dsum.argmin(axis=-1)
>     ...     idxs  = [ np.arange(dlen) for dlen in data.shape[:-2] ]
>     ...     idxs += [ dmin ]
>     ...     idxs += [ slice(None) ]
>     ...     return data[tuple(idxs)]
>>>> d1.shape   
>     (2, 5, 10)
>>>> get_best_run(d1, "time")
>     (2, 10)
>>>> d2.shape
>     (1, 2, 16, 5, 10)
>>>> get_best_run(d2, "time")
>     Traceback (most recent call last):
>       ...
>       File "./plot-user.py", line 89, in get_best_run
>         res = data.view(np.ndarray)[tuple(idxs)]
>     ValueError: shape mismatch: objects cannot be broadcast to a single shape

> After reading the "Advanced indexing section", my understanding is that the
> elements in "idxs" are not broadcastable to the same shape, but I'm not sure how
> I should build them to be broadcastable to what specific shape.

BTW, here's an equivalent that seems to work on all cases, although I would
prefer to avoid control code to manually fill-in the result:

    >>> def get_best_run (data, field):
    ...     """Returns the best run."""
    ...     data = data.view(np.ndarray)
    ...     assert data.ndim >= 2
    ...     dsum = data[field].sum(axis=-1)
    ...     dmin = dsum.argmin(axis=-1)
    ...  
    ...     res_shape = list(data.shape)
    ...     del res_shape[-2]
    ...     res = np.ndarray(res_shape, dtype = data.dtype)
    ...  
    ...     idxs = np.unravel_index(np.arange(dmin.size), dmin.shape)
    ...     for idx in itertools.izip(*idxs):
    ...         isum = dsum[idx]
    ...         imin = dmin[idx]
    ...         idata = data[idx]
    ...         res[idx] = data[tuple(list(idx) + [imin])]
    ...  
    ...     return res
    >>> d1.shape   
    (2, 5, 10)
    >>> get_best_run(d1, "time")
    (2, 10)
    >>> d2.shape
    (1, 2, 16, 5, 10)
    >>> get_best_run(d2, "time")
    (1, 2, 16, 10)

Thanks,
  Lluis

>>> >>> d1_best.shape
>>> (2, 2, 10)
>>> 
>>> 
>>> Assuming 1st dimension is the test, 2nd the run and 10th the iterations, using
>>> this previous code with some example values:
>>> 
>>> >>> dmin
>>> [4 3]
>>> >>> d1_best
>>> [[[ ... contents of d1[0,4,:] ...]
>>> [ ... contents of d1[0,3,:] ...]]
>>> [[ ... contents of d1[1,4,:] ...]
>>> [ ... contents of d1[1,3,:] ...]]]
>>> 
>>> 
>>> While I actually want this:
>>> 
>>> [[ ... contents of d1[0,4,:] ...]
>>> [ ... contents of d1[1,3,:] ...]]

-- 
 "And it's much the same thing with knowledge, for whenever you learn
 something new, the whole world becomes that much richer."
 -- The Princess of Pure Reason, as told by Norton Juster in The Phantom
 Tollbooth