[Numpy-discussion] Does a `mergesorted` function make sense?

Eelco Hoogendoorn hoogendoorn.eelco at gmail.com
Sun Aug 31 15:48:33 EDT 2014


Ive organized all code I had relating to this subject in a github repository
<https://github.com/EelcoHoogendoorn/Numpy_arraysetops_EP>. That should
facilitate shooting around ideas. Ive also added more documentation and
structure to make it easier to see what is going on.

Hopefully we can converge on a common vision, and then improve the
documentation and testing to make it worthy of including in the numpy
master.

Note that there is also a complete rewrite of the classic
numpy.arraysetops, such that they are also generalized to more complex
input, such as finding unique graph edges, and so on.

You mentioned getting the numpy core developers involved; are they not
subscribed to this mailing list? I wouldn't be surprised; youd hope there
is a channel of discussion concerning development with higher signal to
noise....


On Thu, Aug 28, 2014 at 1:49 AM, Eelco Hoogendoorn <
hoogendoorn.eelco at gmail.com> wrote:

> I just checked the docs on ufuncs, and it appears that's a solved problem
> now, since ufunc.reduceat now comes with an axis argument. Or maybe it
> already did when I wrote that, but I simply wasn't paying attention. Either
> way, the code is fully vectorized now, in both grouped and non-grouped
> axes. Its a lot of code, but all that happens for a grouping other than
> some O(1) and O(n) stuff is an argsort of the keys, and then the reduction
> itself, all fully vectorized.
>
> Note that I sort the values first, and then use ufunc.reduceat on the
> groups. It would seem to me that ufunc.at should be more efficient, by
> avoiding this indirection, but testing very much revealed the opposite, for
> reasons unclear to me. Perhaps that's changed now as well.
>
>
> On Wed, Aug 27, 2014 at 11:32 PM, Jaime Fernández del Río <
> jaime.frio at gmail.com> wrote:
>
>> Yes, I was aware of that. But the point would be to provide true
>> vectorization on those operations.
>>
>> The way I see it, numpy may not have to have a GroupBy implementation,
>> but it should at least enable implementing one that is fast and efficient
>> over any axis.
>>
>>
>> On Wed, Aug 27, 2014 at 12:38 PM, Eelco Hoogendoorn <
>> hoogendoorn.eelco at gmail.com> wrote:
>>
>>> i.e, if the grouped axis is small but the other axes are not, you could
>>> write this, which avoids the python loop over the long axis that
>>> np.vectorize would otherwise perform.
>>>
>>> import numpy as np
>>> from grouping import group_by
>>> keys = np.random.randint(0,4,10)
>>> values = np.random.rand(10,2000)
>>> for k,g in zip(*group_by(keys)(values)):
>>>     print k, g.mean(0)
>>>
>>>
>>>
>>>
>>> On Wed, Aug 27, 2014 at 9:29 PM, Eelco Hoogendoorn <
>>> hoogendoorn.eelco at gmail.com> wrote:
>>>
>>>> f.i., this works as expected as well (100 keys of 1d int arrays and 100
>>>> values of 1d float arrays):
>>>>
>>>> group_by(randint(0,4,(100,2))).mean(rand(100,2))
>>>>
>>>>
>>>> On Wed, Aug 27, 2014 at 9:27 PM, Eelco Hoogendoorn <
>>>> hoogendoorn.eelco at gmail.com> wrote:
>>>>
>>>>> If I understand you correctly, the current implementation supports
>>>>> these operations. All reductions over groups (except for median) are
>>>>> performed through the corresponding ufunc (see GroupBy.reduce). This works
>>>>> on multidimensional arrays as well, although this broadcasting over the
>>>>> non-grouping axes is accomplished using np.vectorize. Actual vectorization
>>>>> only happens over the axis being grouped over, but this is usually a long
>>>>> axis. If it isn't, it is more efficient to perform a reduction by means of
>>>>> splitting the array by its groups first, and then map the iterable of
>>>>> groups over some reduction operation (as noted in the docstring of
>>>>> GroupBy.reduce).
>>>>>
>>>>>
>>>>> On Wed, Aug 27, 2014 at 8:29 PM, Jaime Fernández del Río <
>>>>> jaime.frio at gmail.com> wrote:
>>>>>
>>>>>> Hi Eelco,
>>>>>>
>>>>>> I took a deeper look into your code a couple of weeks back. I don't
>>>>>> think I have fully grasped what it allows completely, but I agree that some
>>>>>> form of what you have there is highly desirable. Along the same lines, for
>>>>>> sometime I have been thinking that the right place for a `groupby` in numpy
>>>>>> is as a method of ufuncs, so that `np.add.groupby(arr, groups)` would do a
>>>>>> multidimensional version of `np.bincount(groups, weights=arr)`. You would
>>>>>> then need a more powerful version of `np.unique` to produce the `groups`,
>>>>>> but that is something that Joe Kington's old PR was very close to
>>>>>> achieving, that should probably be resurrected as well. But yes, there
>>>>>> seems to be material for a NEP here, and some guidance from one of the
>>>>>> numpy devs would be helpful in getting this somewhere.
>>>>>>
>>>>>> Jaime
>>>>>>
>>>>>>
>>>>>> On Wed, Aug 27, 2014 at 10:35 AM, Eelco Hoogendoorn <
>>>>>> hoogendoorn.eelco at gmail.com> wrote:
>>>>>>
>>>>>>> It wouldn't hurt to have this function, but my intuition is that its
>>>>>>> use will be minimal. If you are already working with sorted arrays, you
>>>>>>> already have a flop cost on that order of magnitude, and the optimized
>>>>>>> merge saves you a factor two at the very most. Using numpy means you are
>>>>>>> sacrificing factors of two and beyond relative to pure C left right and
>>>>>>> center anyway, so if this kind of thing matters to you, you probably wont
>>>>>>> be working in numpy in the first place.
>>>>>>>
>>>>>>> That said, I share your interest in overhauling arraysetops. I see
>>>>>>> many opportunities for expanding its functionality. There is a question
>>>>>>> that amounts to 'how do I do group-by in numpy' on stackoverflow almost
>>>>>>> every week. That would have my top priority, but also things like extending
>>>>>>> np.unique to things like graph edges, or other more complex input, is very
>>>>>>> often useful to me.
>>>>>>>
>>>>>>> Ive written up a draft <http://pastebin.com/c5WLWPbp>a while ago
>>>>>>> which accomplishes all of the above and more. It reimplements functions
>>>>>>> like np.unique around a common Index object. This index object encapsulates
>>>>>>> the precomputation (sorting) required for efficient set-ops on different
>>>>>>> datatypes, and provides a common interface to obtain the kind of
>>>>>>> information you are talking about (which is used extensively internally in
>>>>>>> the implementation of group_by, for instance).
>>>>>>>
>>>>>>> ie, this functionality allows you to write neat things like
>>>>>>> group_by(randint(0,9,(100,2))).median(rand(100))
>>>>>>>
>>>>>>> But I have the feeling much more could be done in this direction,
>>>>>>> and I feel this draft could really use a bit of back and forth. If we are
>>>>>>> going to completely rewrite arraysetops, we might as well do it right.
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Aug 27, 2014 at 7:02 PM, Jaime Fernández del Río <
>>>>>>> jaime.frio at gmail.com> wrote:
>>>>>>>
>>>>>>>> A request was open in github to add a `merge` function to numpy
>>>>>>>> that would merge two sorted 1d arrays into a single sorted 1d array. I have
>>>>>>>> been playing around with that idea for a while, and have a branch in my
>>>>>>>> numpy fork that adds a `mergesorted` function to `numpy.lib`:
>>>>>>>>
>>>>>>>>
>>>>>>>> https://github.com/jaimefrio/numpy/commit/ce5d480afecc989a36e5d2bf4ea1d1ba58a83b0a
>>>>>>>>
>>>>>>>> I drew inspiration from C++ STL algorithms, and merged into a
>>>>>>>> single function what merge, set_union, set_intersection, set_difference and
>>>>>>>> set_symmetric_difference do there.
>>>>>>>>
>>>>>>>> My first thought when implementing this was to not make it a public
>>>>>>>> function, but use it under the hood to speed-up some of the functions of
>>>>>>>> `arraysetops.py`, which are now merging two already sorted functions by
>>>>>>>> doing `np.sort(np.concatenate((a, b)))`. I would need to revisit my
>>>>>>>> testing, but the speed-ups weren't that great.
>>>>>>>>
>>>>>>>> One other thing I saw value in for some of the `arraysetops.py`
>>>>>>>> functions, but couldn't fully figure out, was in providing extra output
>>>>>>>> aside from the merged arrays, either in the form of indices, or of boolean
>>>>>>>> masks, indicating which items of the original arrays made it into the
>>>>>>>> merged one, and/or where did they end up in it.
>>>>>>>>
>>>>>>>> Since there is at least one other person out there that likes it,
>>>>>>>> is there any more interest in such a function? If yes, any comments on what
>>>>>>>> the proper interface for extra output should be? Although perhaps the best
>>>>>>>> is to leave that out for starters and see what use people make of it, if
>>>>>>>> any.
>>>>>>>>
>>>>>>>> Jaime
>>>>>>>>
>>>>>>>> --
>>>>>>>> (\__/)
>>>>>>>> ( O.o)
>>>>>>>> ( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus
>>>>>>>> planes de dominación mundial.
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> NumPy-Discussion mailing list
>>>>>>>> NumPy-Discussion at scipy.org
>>>>>>>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> NumPy-Discussion mailing list
>>>>>>> NumPy-Discussion at scipy.org
>>>>>>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> (\__/)
>>>>>> ( O.o)
>>>>>> ( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus
>>>>>> planes de dominación mundial.
>>>>>>
>>>>>> _______________________________________________
>>>>>> NumPy-Discussion mailing list
>>>>>> NumPy-Discussion at scipy.org
>>>>>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>> _______________________________________________
>>> NumPy-Discussion mailing list
>>> NumPy-Discussion at scipy.org
>>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>>
>>>
>>
>>
>> --
>> (\__/)
>> ( O.o)
>> ( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes
>> de dominación mundial.
>>
>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-Discussion at scipy.org
>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20140831/8c07db90/attachment.html>


More information about the NumPy-Discussion mailing list