[Numpy-discussion] Does a `mergesorted` function make sense?

Wed Aug 27 15:29:49 EDT 2014

f.i., this works as expected as well (100 keys of 1d int arrays and 100
values of 1d float arrays):

group_by(randint(0,4,(100,2))).mean(rand(100,2))

On Wed, Aug 27, 2014 at 9:27 PM, Eelco Hoogendoorn <
hoogendoorn.eelco at gmail.com> wrote:

> If I understand you correctly, the current implementation supports these
> operations. All reductions over groups (except for median) are performed
> through the corresponding ufunc (see GroupBy.reduce). This works on
> multidimensional arrays as well, although this broadcasting over the
> non-grouping axes is accomplished using np.vectorize. Actual vectorization
> only happens over the axis being grouped over, but this is usually a long
> axis. If it isn't, it is more efficient to perform a reduction by means of
> splitting the array by its groups first, and then map the iterable of
> groups over some reduction operation (as noted in the docstring of
> GroupBy.reduce).
>
>
> On Wed, Aug 27, 2014 at 8:29 PM, Jaime Fernández del Río <
> jaime.frio at gmail.com> wrote:
>
>> Hi Eelco,
>>
>> I took a deeper look into your code a couple of weeks back. I don't think
>> I have fully grasped what it allows completely, but I agree that some form
>> of what you have there is highly desirable. Along the same lines, for
>> sometime I have been thinking that the right place for a `groupby` in numpy
>> is as a method of ufuncs, so that `np.add.groupby(arr, groups)` would do a
>> multidimensional version of `np.bincount(groups, weights=arr)`. You would
>> then need a more powerful version of `np.unique` to produce the `groups`,
>> but that is something that Joe Kington's old PR was very close to
>> achieving, that should probably be resurrected as well. But yes, there
>> seems to be material for a NEP here, and some guidance from one of the
>> numpy devs would be helpful in getting this somewhere.
>>
>> Jaime
>>
>>
>> On Wed, Aug 27, 2014 at 10:35 AM, Eelco Hoogendoorn <
>> hoogendoorn.eelco at gmail.com> wrote:
>>
>>> It wouldn't hurt to have this function, but my intuition is that its use
>>> will be minimal. If you are already working with sorted arrays, you already
>>> have a flop cost on that order of magnitude, and the optimized merge saves
>>> you a factor two at the very most. Using numpy means you are sacrificing
>>> factors of two and beyond relative to pure C left right and center anyway,
>>> so if this kind of thing matters to you, you probably wont be working in
>>> numpy in the first place.
>>>
>>> That said, I share your interest in overhauling arraysetops. I see many
>>> opportunities for expanding its functionality. There is a question that
>>> amounts to 'how do I do group-by in numpy' on stackoverflow almost every
>>> week. That would have my top priority, but also things like extending
>>> np.unique to things like graph edges, or other more complex input, is very
>>> often useful to me.
>>>
>>> Ive written up a draft <http://pastebin.com/c5WLWPbp>a while ago which
>>> accomplishes all of the above and more. It reimplements functions like
>>> np.unique around a common Index object. This index object encapsulates the
>>> precomputation (sorting) required for efficient set-ops on different
>>> datatypes, and provides a common interface to obtain the kind of
>>> information you are talking about (which is used extensively internally in
>>> the implementation of group_by, for instance).
>>>
>>> ie, this functionality allows you to write neat things like
>>> group_by(randint(0,9,(100,2))).median(rand(100))
>>>
>>> But I have the feeling much more could be done in this direction, and I
>>> feel this draft could really use a bit of back and forth. If we are going
>>> to completely rewrite arraysetops, we might as well do it right.
>>>
>>>
>>> On Wed, Aug 27, 2014 at 7:02 PM, Jaime Fernández del Río <
>>> jaime.frio at gmail.com> wrote:
>>>
>>>> A request was open in github to add a `merge` function to numpy that
>>>> would merge two sorted 1d arrays into a single sorted 1d array. I have been
>>>> playing around with that idea for a while, and have a branch in my numpy
>>>> fork that adds a `mergesorted` function to `numpy.lib`:
>>>>
>>>>
>>>> https://github.com/jaimefrio/numpy/commit/ce5d480afecc989a36e5d2bf4ea1d1ba58a83b0a
>>>>
>>>> I drew inspiration from C++ STL algorithms, and merged into a single
>>>> function what merge, set_union, set_intersection, set_difference and
>>>> set_symmetric_difference do there.
>>>>
>>>> My first thought when implementing this was to not make it a public
>>>> function, but use it under the hood to speed-up some of the functions of
>>>> `arraysetops.py`, which are now merging two already sorted functions by
>>>> doing `np.sort(np.concatenate((a, b)))`. I would need to revisit my
>>>> testing, but the speed-ups weren't that great.
>>>>
>>>> One other thing I saw value in for some of the `arraysetops.py`
>>>> functions, but couldn't fully figure out, was in providing extra output
>>>> aside from the merged arrays, either in the form of indices, or of boolean
>>>> masks, indicating which items of the original arrays made it into the
>>>> merged one, and/or where did they end up in it.
>>>>
>>>> Since there is at least one other person out there that likes it, is
>>>> there any more interest in such a function? If yes, any comments on what
>>>> the proper interface for extra output should be? Although perhaps the best
>>>> is to leave that out for starters and see what use people make of it, if
>>>> any.
>>>>
>>>> Jaime
>>>>
>>>> --
>>>> (\__/)
>>>> ( O.o)
>>>> ( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus
>>>> planes de dominación mundial.
>>>>
>>>> _______________________________________________
>>>> NumPy-Discussion mailing list
>>>> NumPy-Discussion at scipy.org
>>>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>>>
>>>>
>>>
>>> _______________________________________________
>>> NumPy-Discussion mailing list
>>> NumPy-Discussion at scipy.org
>>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>>
>>>
>>
>>
>> --
>> (\__/)
>> ( O.o)
>> ( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes
>> de dominación mundial.
>>
>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-Discussion at scipy.org
>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20140827/70d7cdfb/attachment.html>