[Numpy-discussion] Does a `mergesorted` function make sense?

Jaime Fernández del Río jaime.frio at gmail.com
Wed Aug 27 14:29:27 EDT 2014


Hi Eelco,

I took a deeper look into your code a couple of weeks back. I don't think I
have fully grasped what it allows completely, but I agree that some form of
what you have there is highly desirable. Along the same lines, for sometime
I have been thinking that the right place for a `groupby` in numpy is as a
method of ufuncs, so that `np.add.groupby(arr, groups)` would do a
multidimensional version of `np.bincount(groups, weights=arr)`. You would
then need a more powerful version of `np.unique` to produce the `groups`,
but that is something that Joe Kington's old PR was very close to
achieving, that should probably be resurrected as well. But yes, there
seems to be material for a NEP here, and some guidance from one of the
numpy devs would be helpful in getting this somewhere.

Jaime


On Wed, Aug 27, 2014 at 10:35 AM, Eelco Hoogendoorn <
hoogendoorn.eelco at gmail.com> wrote:

> It wouldn't hurt to have this function, but my intuition is that its use
> will be minimal. If you are already working with sorted arrays, you already
> have a flop cost on that order of magnitude, and the optimized merge saves
> you a factor two at the very most. Using numpy means you are sacrificing
> factors of two and beyond relative to pure C left right and center anyway,
> so if this kind of thing matters to you, you probably wont be working in
> numpy in the first place.
>
> That said, I share your interest in overhauling arraysetops. I see many
> opportunities for expanding its functionality. There is a question that
> amounts to 'how do I do group-by in numpy' on stackoverflow almost every
> week. That would have my top priority, but also things like extending
> np.unique to things like graph edges, or other more complex input, is very
> often useful to me.
>
> Ive written up a draft <http://pastebin.com/c5WLWPbp>a while ago which
> accomplishes all of the above and more. It reimplements functions like
> np.unique around a common Index object. This index object encapsulates the
> precomputation (sorting) required for efficient set-ops on different
> datatypes, and provides a common interface to obtain the kind of
> information you are talking about (which is used extensively internally in
> the implementation of group_by, for instance).
>
> ie, this functionality allows you to write neat things like
> group_by(randint(0,9,(100,2))).median(rand(100))
>
> But I have the feeling much more could be done in this direction, and I
> feel this draft could really use a bit of back and forth. If we are going
> to completely rewrite arraysetops, we might as well do it right.
>
>
> On Wed, Aug 27, 2014 at 7:02 PM, Jaime Fernández del Río <
> jaime.frio at gmail.com> wrote:
>
>> A request was open in github to add a `merge` function to numpy that
>> would merge two sorted 1d arrays into a single sorted 1d array. I have been
>> playing around with that idea for a while, and have a branch in my numpy
>> fork that adds a `mergesorted` function to `numpy.lib`:
>>
>>
>> https://github.com/jaimefrio/numpy/commit/ce5d480afecc989a36e5d2bf4ea1d1ba58a83b0a
>>
>> I drew inspiration from C++ STL algorithms, and merged into a single
>> function what merge, set_union, set_intersection, set_difference and
>> set_symmetric_difference do there.
>>
>> My first thought when implementing this was to not make it a public
>> function, but use it under the hood to speed-up some of the functions of
>> `arraysetops.py`, which are now merging two already sorted functions by
>> doing `np.sort(np.concatenate((a, b)))`. I would need to revisit my
>> testing, but the speed-ups weren't that great.
>>
>> One other thing I saw value in for some of the `arraysetops.py`
>> functions, but couldn't fully figure out, was in providing extra output
>> aside from the merged arrays, either in the form of indices, or of boolean
>> masks, indicating which items of the original arrays made it into the
>> merged one, and/or where did they end up in it.
>>
>> Since there is at least one other person out there that likes it, is
>> there any more interest in such a function? If yes, any comments on what
>> the proper interface for extra output should be? Although perhaps the best
>> is to leave that out for starters and see what use people make of it, if
>> any.
>>
>> Jaime
>>
>> --
>> (\__/)
>> ( O.o)
>> ( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes
>> de dominación mundial.
>>
>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-Discussion at scipy.org
>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>
>>
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>


-- 
(\__/)
( O.o)
( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes
de dominación mundial.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20140827/d42fbec6/attachment.html>


More information about the NumPy-Discussion mailing list