Re: [Numpy-discussion] Does a `mergesorted` function make sense?

27 Aug 2014


      If I understand you correctly, the current implementation supports these
operations. All reductions over groups (except for median) are performed
through the corresponding ufunc (see GroupBy.reduce). This works on
multidimensional arrays as well, although this broadcasting over the
non-grouping axes is accomplished using np.vectorize. Actual vectorization
only happens over the axis being grouped over, but this is usually a long
axis. If it isn't, it is more efficient to perform a reduction by means of
splitting the array by its groups first, and then map the iterable of
groups over some reduction operation (as noted in the docstring of
GroupBy.reduce).


On Wed, Aug 27, 2014 at 8:29 PM, Jaime Fernández del Río <
jaime.frio@gmail.com> wrote:
...
Hi Eelco,
I took a deeper look into your code a couple of weeks back. I don't think
I have fully grasped what it allows completely, but I agree that some form
of what you have there is highly desirable. Along the same lines, for
sometime I have been thinking that the right place for a `groupby` in numpy
is as a method of ufuncs, so that `np.add.groupby(arr, groups)` would do a
multidimensional version of `np.bincount(groups, weights=arr)`. You would
then need a more powerful version of `np.unique` to produce the `groups`,
but that is something that Joe Kington's old PR was very close to
achieving, that should probably be resurrected as well. But yes, there
seems to be material for a NEP here, and some guidance from one of the
numpy devs would be helpful in getting this somewhere.
Jaime
On Wed, Aug 27, 2014 at 10:35 AM, Eelco Hoogendoorn <
hoogendoorn.eelco@gmail.com> wrote:
...
It wouldn't hurt to have this function, but my intuition is that its use
will be minimal. If you are already working with sorted arrays, you already
have a flop cost on that order of magnitude, and the optimized merge saves
you a factor two at the very most. Using numpy means you are sacrificing
factors of two and beyond relative to pure C left right and center anyway,
so if this kind of thing matters to you, you probably wont be working in
numpy in the first place.
That said, I share your interest in overhauling arraysetops. I see many
opportunities for expanding its functionality. There is a question that
amounts to 'how do I do group-by in numpy' on stackoverflow almost every
week. That would have my top priority, but also things like extending
np.unique to things like graph edges, or other more complex input, is very
often useful to me.
Ive written up a draft http://pastebin.com/c5WLWPbpa while ago which
accomplishes all of the above and more. It reimplements functions like
np.unique around a common Index object. This index object encapsulates the
precomputation (sorting) required for efficient set-ops on different
datatypes, and provides a common interface to obtain the kind of
information you are talking about (which is used extensively internally in
the implementation of group_by, for instance).
ie, this functionality allows you to write neat things like
group_by(randint(0,9,(100,2))).median(rand(100))
But I have the feeling much more could be done in this direction, and I
feel this draft could really use a bit of back and forth. If we are going
to completely rewrite arraysetops, we might as well do it right.
On Wed, Aug 27, 2014 at 7:02 PM, Jaime Fernández del Río <
jaime.frio@gmail.com> wrote:
...
A request was open in github to add a `merge` function to numpy that
would merge two sorted 1d arrays into a single sorted 1d array. I have been
playing around with that idea for a while, and have a branch in my numpy
fork that adds a `mergesorted` function to `numpy.lib`:
https://github.com/jaimefrio/numpy/commit/ce5d480afecc989a36e5d2bf4ea1d1ba58...
I drew inspiration from C++ STL algorithms, and merged into a single
function what merge, set_union, set_intersection, set_difference and
set_symmetric_difference do there.
My first thought when implementing this was to not make it a public
function, but use it under the hood to speed-up some of the functions of
`arraysetops.py`, which are now merging two already sorted functions by
doing `np.sort(np.concatenate((a, b)))`. I would need to revisit my
testing, but the speed-ups weren't that great.
One other thing I saw value in for some of the `arraysetops.py`
functions, but couldn't fully figure out, was in providing extra output
aside from the merged arrays, either in the form of indices, or of boolean
masks, indicating which items of the original arrays made it into the
merged one, and/or where did they end up in it.
Since there is at least one other person out there that likes it, is
there any more interest in such a function? If yes, any comments on what
the proper interface for extra output should be? Although perhaps the best
is to leave that out for starters and see what use people make of it, if
any.
Jaime
--
(\__/)
( O.o)
( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus
planes de dominación mundial.
_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion
--
(\__/)
( O.o)
( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes
de dominación mundial.
_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion