It wouldn't hurt to have this function, but my intuition is that its use will be minimal. If you are already working with sorted arrays, you already have a flop cost on that order of magnitude, and the optimized merge saves you a factor two at the very most. Using numpy means you are sacrificing factors of two and beyond relative to pure C left right and center anyway, so if this kind of thing matters to you, you probably wont be working in numpy in the first place.

That said, I share your interest in overhauling arraysetops. I see many opportunities for expanding its functionality. There is a question that amounts to 'how do I do group-by in numpy' on stackoverflow almost every week. That would have my top priority, but also things like extending np.unique to things like graph edges, or other more complex input, is very often useful to me.

Ive written up a draft a while ago which accomplishes all of the above and more. It reimplements functions like np.unique around a common Index object. This index object encapsulates the precomputation (sorting) required for efficient set-ops on different datatypes, and provides a common interface to obtain the kind of information you are talking about (which is used extensively internally in the implementation of group_by, for instance).

ie, this functionality allows you to write neat things like group_by(randint(0,9,(100,2))).median(rand(100))

But I have the feeling much more could be done in this direction, and I feel this draft could really use a bit of back and forth. If we are going to completely rewrite arraysetops, we might as well do it right.

That said, I share your interest in overhauling arraysetops. I see many opportunities for expanding its functionality. There is a question that amounts to 'how do I do group-by in numpy' on stackoverflow almost every week. That would have my top priority, but also things like extending np.unique to things like graph edges, or other more complex input, is very often useful to me.

Ive written up a draft a while ago which accomplishes all of the above and more. It reimplements functions like np.unique around a common Index object. This index object encapsulates the precomputation (sorting) required for efficient set-ops on different datatypes, and provides a common interface to obtain the kind of information you are talking about (which is used extensively internally in the implementation of group_by, for instance).

ie, this functionality allows you to write neat things like group_by(randint(0,9,(100,2))).median(rand(100))

But I have the feeling much more could be done in this direction, and I feel this draft could really use a bit of back and forth. If we are going to completely rewrite arraysetops, we might as well do it right.

On Wed, Aug 27, 2014 at 7:02 PM, Jaime Fernández del Río <jaime.frio@gmail.com> wrote:

A request was open in github to add a `merge` function to numpy that would merge two sorted 1d arrays into a single sorted 1d array. I have been playing around with that idea for a while, and have a branch in my numpy fork that adds a `mergesorted` function to `numpy.lib`:I drew inspiration from C++ STL algorithms, and merged into a single function what merge, set_union, set_intersection, set_difference and set_symmetric_difference do there.My first thought when implementing this was to not make it a public function, but use it under the hood to speed-up some of the functions of `arraysetops.py`, which are now merging two already sorted functions by doing `np.sort(np.concatenate((a, b)))`. I would need to revisit my testing, but the speed-ups weren't that great.One other thing I saw value in for some of the `arraysetops.py` functions, but couldn't fully figure out, was in providing extra output aside from the merged arrays, either in the form of indices, or of boolean masks, indicating which items of the original arrays made it into the merged one, and/or where did they end up in it.Since there is at least one other person out there that likes it, is there any more interest in such a function? If yes, any comments on what the proper interface for extra output should be? Although perhaps the best is to leave that out for starters and see what use people make of it, if any.Jaime--

(\__/)

( O.o)

( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial.

_______________________________________________

NumPy-Discussion mailing list

NumPy-Discussion@scipy.org

http://mail.scipy.org/mailman/listinfo/numpy-discussion