[Numpy-discussion] Does a `mergesorted` function make sense?

Jaime Fernández del Río jaime.frio at gmail.com
Wed Aug 27 17:32:37 EDT 2014


Yes, I was aware of that. But the point would be to provide true
vectorization on those operations.

The way I see it, numpy may not have to have a GroupBy implementation, but
it should at least enable implementing one that is fast and efficient over
any axis.


On Wed, Aug 27, 2014 at 12:38 PM, Eelco Hoogendoorn <
hoogendoorn.eelco at gmail.com> wrote:

> i.e, if the grouped axis is small but the other axes are not, you could
> write this, which avoids the python loop over the long axis that
> np.vectorize would otherwise perform.
>
> import numpy as np
> from grouping import group_by
> keys = np.random.randint(0,4,10)
> values = np.random.rand(10,2000)
> for k,g in zip(*group_by(keys)(values)):
>     print k, g.mean(0)
>
>
>
>
> On Wed, Aug 27, 2014 at 9:29 PM, Eelco Hoogendoorn <
> hoogendoorn.eelco at gmail.com> wrote:
>
>> f.i., this works as expected as well (100 keys of 1d int arrays and 100
>> values of 1d float arrays):
>>
>> group_by(randint(0,4,(100,2))).mean(rand(100,2))
>>
>>
>> On Wed, Aug 27, 2014 at 9:27 PM, Eelco Hoogendoorn <
>> hoogendoorn.eelco at gmail.com> wrote:
>>
>>> If I understand you correctly, the current implementation supports these
>>> operations. All reductions over groups (except for median) are performed
>>> through the corresponding ufunc (see GroupBy.reduce). This works on
>>> multidimensional arrays as well, although this broadcasting over the
>>> non-grouping axes is accomplished using np.vectorize. Actual vectorization
>>> only happens over the axis being grouped over, but this is usually a long
>>> axis. If it isn't, it is more efficient to perform a reduction by means of
>>> splitting the array by its groups first, and then map the iterable of
>>> groups over some reduction operation (as noted in the docstring of
>>> GroupBy.reduce).
>>>
>>>
>>> On Wed, Aug 27, 2014 at 8:29 PM, Jaime Fernández del Río <
>>> jaime.frio at gmail.com> wrote:
>>>
>>>> Hi Eelco,
>>>>
>>>> I took a deeper look into your code a couple of weeks back. I don't
>>>> think I have fully grasped what it allows completely, but I agree that some
>>>> form of what you have there is highly desirable. Along the same lines, for
>>>> sometime I have been thinking that the right place for a `groupby` in numpy
>>>> is as a method of ufuncs, so that `np.add.groupby(arr, groups)` would do a
>>>> multidimensional version of `np.bincount(groups, weights=arr)`. You would
>>>> then need a more powerful version of `np.unique` to produce the `groups`,
>>>> but that is something that Joe Kington's old PR was very close to
>>>> achieving, that should probably be resurrected as well. But yes, there
>>>> seems to be material for a NEP here, and some guidance from one of the
>>>> numpy devs would be helpful in getting this somewhere.
>>>>
>>>> Jaime
>>>>
>>>>
>>>> On Wed, Aug 27, 2014 at 10:35 AM, Eelco Hoogendoorn <
>>>> hoogendoorn.eelco at gmail.com> wrote:
>>>>
>>>>> It wouldn't hurt to have this function, but my intuition is that its
>>>>> use will be minimal. If you are already working with sorted arrays, you
>>>>> already have a flop cost on that order of magnitude, and the optimized
>>>>> merge saves you a factor two at the very most. Using numpy means you are
>>>>> sacrificing factors of two and beyond relative to pure C left right and
>>>>> center anyway, so if this kind of thing matters to you, you probably wont
>>>>> be working in numpy in the first place.
>>>>>
>>>>> That said, I share your interest in overhauling arraysetops. I see
>>>>> many opportunities for expanding its functionality. There is a question
>>>>> that amounts to 'how do I do group-by in numpy' on stackoverflow almost
>>>>> every week. That would have my top priority, but also things like extending
>>>>> np.unique to things like graph edges, or other more complex input, is very
>>>>> often useful to me.
>>>>>
>>>>> Ive written up a draft <http://pastebin.com/c5WLWPbp>a while ago
>>>>> which accomplishes all of the above and more. It reimplements functions
>>>>> like np.unique around a common Index object. This index object encapsulates
>>>>> the precomputation (sorting) required for efficient set-ops on different
>>>>> datatypes, and provides a common interface to obtain the kind of
>>>>> information you are talking about (which is used extensively internally in
>>>>> the implementation of group_by, for instance).
>>>>>
>>>>> ie, this functionality allows you to write neat things like
>>>>> group_by(randint(0,9,(100,2))).median(rand(100))
>>>>>
>>>>> But I have the feeling much more could be done in this direction, and
>>>>> I feel this draft could really use a bit of back and forth. If we are going
>>>>> to completely rewrite arraysetops, we might as well do it right.
>>>>>
>>>>>
>>>>> On Wed, Aug 27, 2014 at 7:02 PM, Jaime Fernández del Río <
>>>>> jaime.frio at gmail.com> wrote:
>>>>>
>>>>>> A request was open in github to add a `merge` function to numpy that
>>>>>> would merge two sorted 1d arrays into a single sorted 1d array. I have been
>>>>>> playing around with that idea for a while, and have a branch in my numpy
>>>>>> fork that adds a `mergesorted` function to `numpy.lib`:
>>>>>>
>>>>>>
>>>>>> https://github.com/jaimefrio/numpy/commit/ce5d480afecc989a36e5d2bf4ea1d1ba58a83b0a
>>>>>>
>>>>>> I drew inspiration from C++ STL algorithms, and merged into a single
>>>>>> function what merge, set_union, set_intersection, set_difference and
>>>>>> set_symmetric_difference do there.
>>>>>>
>>>>>> My first thought when implementing this was to not make it a public
>>>>>> function, but use it under the hood to speed-up some of the functions of
>>>>>> `arraysetops.py`, which are now merging two already sorted functions by
>>>>>> doing `np.sort(np.concatenate((a, b)))`. I would need to revisit my
>>>>>> testing, but the speed-ups weren't that great.
>>>>>>
>>>>>> One other thing I saw value in for some of the `arraysetops.py`
>>>>>> functions, but couldn't fully figure out, was in providing extra output
>>>>>> aside from the merged arrays, either in the form of indices, or of boolean
>>>>>> masks, indicating which items of the original arrays made it into the
>>>>>> merged one, and/or where did they end up in it.
>>>>>>
>>>>>> Since there is at least one other person out there that likes it, is
>>>>>> there any more interest in such a function? If yes, any comments on what
>>>>>> the proper interface for extra output should be? Although perhaps the best
>>>>>> is to leave that out for starters and see what use people make of it, if
>>>>>> any.
>>>>>>
>>>>>> Jaime
>>>>>>
>>>>>> --
>>>>>> (\__/)
>>>>>> ( O.o)
>>>>>> ( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus
>>>>>> planes de dominación mundial.
>>>>>>
>>>>>> _______________________________________________
>>>>>> NumPy-Discussion mailing list
>>>>>> NumPy-Discussion at scipy.org
>>>>>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>>>>>
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> NumPy-Discussion mailing list
>>>>> NumPy-Discussion at scipy.org
>>>>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> (\__/)
>>>> ( O.o)
>>>> ( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus
>>>> planes de dominación mundial.
>>>>
>>>> _______________________________________________
>>>> NumPy-Discussion mailing list
>>>> NumPy-Discussion at scipy.org
>>>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>>>
>>>>
>>>
>>
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>


-- 
(\__/)
( O.o)
( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes
de dominación mundial.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20140827/cc6cdc49/attachment.html>


More information about the NumPy-Discussion mailing list