Does a `mergesorted` function make sense?
A request was open in github to add a `merge` function to numpy that would merge two sorted 1d arrays into a single sorted 1d array. I have been playing around with that idea for a while, and have a branch in my numpy fork that adds a `mergesorted` function to `numpy.lib`:
https://github.com/jaimefrio/numpy/commit/ce5d480afecc989a36e5d2bf4ea1d1ba58...
I drew inspiration from C++ STL algorithms, and merged into a single function what merge, set_union, set_intersection, set_difference and set_symmetric_difference do there.
My first thought when implementing this was to not make it a public function, but use it under the hood to speedup some of the functions of `arraysetops.py`, which are now merging two already sorted functions by doing `np.sort(np.concatenate((a, b)))`. I would need to revisit my testing, but the speedups weren't that great.
One other thing I saw value in for some of the `arraysetops.py` functions, but couldn't fully figure out, was in providing extra output aside from the merged arrays, either in the form of indices, or of boolean masks, indicating which items of the original arrays made it into the merged one, and/or where did they end up in it.
Since there is at least one other person out there that likes it, is there any more interest in such a function? If yes, any comments on what the proper interface for extra output should be? Although perhaps the best is to leave that out for starters and see what use people make of it, if any.
Jaime
It wouldn't hurt to have this function, but my intuition is that its use will be minimal. If you are already working with sorted arrays, you already have a flop cost on that order of magnitude, and the optimized merge saves you a factor two at the very most. Using numpy means you are sacrificing factors of two and beyond relative to pure C left right and center anyway, so if this kind of thing matters to you, you probably wont be working in numpy in the first place.
That said, I share your interest in overhauling arraysetops. I see many opportunities for expanding its functionality. There is a question that amounts to 'how do I do groupby in numpy' on stackoverflow almost every week. That would have my top priority, but also things like extending np.unique to things like graph edges, or other more complex input, is very often useful to me.
Ive written up a draft http://pastebin.com/c5WLWPbpa while ago which accomplishes all of the above and more. It reimplements functions like np.unique around a common Index object. This index object encapsulates the precomputation (sorting) required for efficient setops on different datatypes, and provides a common interface to obtain the kind of information you are talking about (which is used extensively internally in the implementation of group_by, for instance).
ie, this functionality allows you to write neat things like group_by(randint(0,9,(100,2))).median(rand(100))
But I have the feeling much more could be done in this direction, and I feel this draft could really use a bit of back and forth. If we are going to completely rewrite arraysetops, we might as well do it right.
On Wed, Aug 27, 2014 at 7:02 PM, Jaime Fernández del Río < jaime.frio@gmail.com> wrote:
A request was open in github to add a `merge` function to numpy that would merge two sorted 1d arrays into a single sorted 1d array. I have been playing around with that idea for a while, and have a branch in my numpy fork that adds a `mergesorted` function to `numpy.lib`:
https://github.com/jaimefrio/numpy/commit/ce5d480afecc989a36e5d2bf4ea1d1ba58...
I drew inspiration from C++ STL algorithms, and merged into a single function what merge, set_union, set_intersection, set_difference and set_symmetric_difference do there.
My first thought when implementing this was to not make it a public function, but use it under the hood to speedup some of the functions of `arraysetops.py`, which are now merging two already sorted functions by doing `np.sort(np.concatenate((a, b)))`. I would need to revisit my testing, but the speedups weren't that great.
One other thing I saw value in for some of the `arraysetops.py` functions, but couldn't fully figure out, was in providing extra output aside from the merged arrays, either in the form of indices, or of boolean masks, indicating which items of the original arrays made it into the merged one, and/or where did they end up in it.
Since there is at least one other person out there that likes it, is there any more interest in such a function? If yes, any comments on what the proper interface for extra output should be? Although perhaps the best is to leave that out for starters and see what use people make of it, if any.
Jaime
 (__/) ( O.o) ( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial.
NumPyDiscussion mailing list NumPyDiscussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpydiscussion
Hi Eelco,
I took a deeper look into your code a couple of weeks back. I don't think I have fully grasped what it allows completely, but I agree that some form of what you have there is highly desirable. Along the same lines, for sometime I have been thinking that the right place for a `groupby` in numpy is as a method of ufuncs, so that `np.add.groupby(arr, groups)` would do a multidimensional version of `np.bincount(groups, weights=arr)`. You would then need a more powerful version of `np.unique` to produce the `groups`, but that is something that Joe Kington's old PR was very close to achieving, that should probably be resurrected as well. But yes, there seems to be material for a NEP here, and some guidance from one of the numpy devs would be helpful in getting this somewhere.
Jaime
On Wed, Aug 27, 2014 at 10:35 AM, Eelco Hoogendoorn < hoogendoorn.eelco@gmail.com> wrote:
It wouldn't hurt to have this function, but my intuition is that its use will be minimal. If you are already working with sorted arrays, you already have a flop cost on that order of magnitude, and the optimized merge saves you a factor two at the very most. Using numpy means you are sacrificing factors of two and beyond relative to pure C left right and center anyway, so if this kind of thing matters to you, you probably wont be working in numpy in the first place.
That said, I share your interest in overhauling arraysetops. I see many opportunities for expanding its functionality. There is a question that amounts to 'how do I do groupby in numpy' on stackoverflow almost every week. That would have my top priority, but also things like extending np.unique to things like graph edges, or other more complex input, is very often useful to me.
Ive written up a draft http://pastebin.com/c5WLWPbpa while ago which accomplishes all of the above and more. It reimplements functions like np.unique around a common Index object. This index object encapsulates the precomputation (sorting) required for efficient setops on different datatypes, and provides a common interface to obtain the kind of information you are talking about (which is used extensively internally in the implementation of group_by, for instance).
ie, this functionality allows you to write neat things like group_by(randint(0,9,(100,2))).median(rand(100))
But I have the feeling much more could be done in this direction, and I feel this draft could really use a bit of back and forth. If we are going to completely rewrite arraysetops, we might as well do it right.
On Wed, Aug 27, 2014 at 7:02 PM, Jaime Fernández del Río < jaime.frio@gmail.com> wrote:
A request was open in github to add a `merge` function to numpy that would merge two sorted 1d arrays into a single sorted 1d array. I have been playing around with that idea for a while, and have a branch in my numpy fork that adds a `mergesorted` function to `numpy.lib`:
https://github.com/jaimefrio/numpy/commit/ce5d480afecc989a36e5d2bf4ea1d1ba58...
I drew inspiration from C++ STL algorithms, and merged into a single function what merge, set_union, set_intersection, set_difference and set_symmetric_difference do there.
My first thought when implementing this was to not make it a public function, but use it under the hood to speedup some of the functions of `arraysetops.py`, which are now merging two already sorted functions by doing `np.sort(np.concatenate((a, b)))`. I would need to revisit my testing, but the speedups weren't that great.
One other thing I saw value in for some of the `arraysetops.py` functions, but couldn't fully figure out, was in providing extra output aside from the merged arrays, either in the form of indices, or of boolean masks, indicating which items of the original arrays made it into the merged one, and/or where did they end up in it.
Since there is at least one other person out there that likes it, is there any more interest in such a function? If yes, any comments on what the proper interface for extra output should be? Although perhaps the best is to leave that out for starters and see what use people make of it, if any.
Jaime
 (__/) ( O.o) ( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial.
NumPyDiscussion mailing list NumPyDiscussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpydiscussion
NumPyDiscussion mailing list NumPyDiscussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpydiscussion
If I understand you correctly, the current implementation supports these operations. All reductions over groups (except for median) are performed through the corresponding ufunc (see GroupBy.reduce). This works on multidimensional arrays as well, although this broadcasting over the nongrouping axes is accomplished using np.vectorize. Actual vectorization only happens over the axis being grouped over, but this is usually a long axis. If it isn't, it is more efficient to perform a reduction by means of splitting the array by its groups first, and then map the iterable of groups over some reduction operation (as noted in the docstring of GroupBy.reduce).
On Wed, Aug 27, 2014 at 8:29 PM, Jaime Fernández del Río < jaime.frio@gmail.com> wrote:
Hi Eelco,
I took a deeper look into your code a couple of weeks back. I don't think I have fully grasped what it allows completely, but I agree that some form of what you have there is highly desirable. Along the same lines, for sometime I have been thinking that the right place for a `groupby` in numpy is as a method of ufuncs, so that `np.add.groupby(arr, groups)` would do a multidimensional version of `np.bincount(groups, weights=arr)`. You would then need a more powerful version of `np.unique` to produce the `groups`, but that is something that Joe Kington's old PR was very close to achieving, that should probably be resurrected as well. But yes, there seems to be material for a NEP here, and some guidance from one of the numpy devs would be helpful in getting this somewhere.
Jaime
On Wed, Aug 27, 2014 at 10:35 AM, Eelco Hoogendoorn < hoogendoorn.eelco@gmail.com> wrote:
It wouldn't hurt to have this function, but my intuition is that its use will be minimal. If you are already working with sorted arrays, you already have a flop cost on that order of magnitude, and the optimized merge saves you a factor two at the very most. Using numpy means you are sacrificing factors of two and beyond relative to pure C left right and center anyway, so if this kind of thing matters to you, you probably wont be working in numpy in the first place.
That said, I share your interest in overhauling arraysetops. I see many opportunities for expanding its functionality. There is a question that amounts to 'how do I do groupby in numpy' on stackoverflow almost every week. That would have my top priority, but also things like extending np.unique to things like graph edges, or other more complex input, is very often useful to me.
Ive written up a draft http://pastebin.com/c5WLWPbpa while ago which accomplishes all of the above and more. It reimplements functions like np.unique around a common Index object. This index object encapsulates the precomputation (sorting) required for efficient setops on different datatypes, and provides a common interface to obtain the kind of information you are talking about (which is used extensively internally in the implementation of group_by, for instance).
ie, this functionality allows you to write neat things like group_by(randint(0,9,(100,2))).median(rand(100))
But I have the feeling much more could be done in this direction, and I feel this draft could really use a bit of back and forth. If we are going to completely rewrite arraysetops, we might as well do it right.
On Wed, Aug 27, 2014 at 7:02 PM, Jaime Fernández del Río < jaime.frio@gmail.com> wrote:
A request was open in github to add a `merge` function to numpy that would merge two sorted 1d arrays into a single sorted 1d array. I have been playing around with that idea for a while, and have a branch in my numpy fork that adds a `mergesorted` function to `numpy.lib`:
https://github.com/jaimefrio/numpy/commit/ce5d480afecc989a36e5d2bf4ea1d1ba58...
I drew inspiration from C++ STL algorithms, and merged into a single function what merge, set_union, set_intersection, set_difference and set_symmetric_difference do there.
My first thought when implementing this was to not make it a public function, but use it under the hood to speedup some of the functions of `arraysetops.py`, which are now merging two already sorted functions by doing `np.sort(np.concatenate((a, b)))`. I would need to revisit my testing, but the speedups weren't that great.
One other thing I saw value in for some of the `arraysetops.py` functions, but couldn't fully figure out, was in providing extra output aside from the merged arrays, either in the form of indices, or of boolean masks, indicating which items of the original arrays made it into the merged one, and/or where did they end up in it.
Since there is at least one other person out there that likes it, is there any more interest in such a function? If yes, any comments on what the proper interface for extra output should be? Although perhaps the best is to leave that out for starters and see what use people make of it, if any.
Jaime
 (__/) ( O.o) ( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial.
NumPyDiscussion mailing list NumPyDiscussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpydiscussion
NumPyDiscussion mailing list NumPyDiscussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpydiscussion
 (__/) ( O.o) ( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial.
NumPyDiscussion mailing list NumPyDiscussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpydiscussion
f.i., this works as expected as well (100 keys of 1d int arrays and 100 values of 1d float arrays):
group_by(randint(0,4,(100,2))).mean(rand(100,2))
On Wed, Aug 27, 2014 at 9:27 PM, Eelco Hoogendoorn < hoogendoorn.eelco@gmail.com> wrote:
If I understand you correctly, the current implementation supports these operations. All reductions over groups (except for median) are performed through the corresponding ufunc (see GroupBy.reduce). This works on multidimensional arrays as well, although this broadcasting over the nongrouping axes is accomplished using np.vectorize. Actual vectorization only happens over the axis being grouped over, but this is usually a long axis. If it isn't, it is more efficient to perform a reduction by means of splitting the array by its groups first, and then map the iterable of groups over some reduction operation (as noted in the docstring of GroupBy.reduce).
On Wed, Aug 27, 2014 at 8:29 PM, Jaime Fernández del Río < jaime.frio@gmail.com> wrote:
Hi Eelco,
I took a deeper look into your code a couple of weeks back. I don't think I have fully grasped what it allows completely, but I agree that some form of what you have there is highly desirable. Along the same lines, for sometime I have been thinking that the right place for a `groupby` in numpy is as a method of ufuncs, so that `np.add.groupby(arr, groups)` would do a multidimensional version of `np.bincount(groups, weights=arr)`. You would then need a more powerful version of `np.unique` to produce the `groups`, but that is something that Joe Kington's old PR was very close to achieving, that should probably be resurrected as well. But yes, there seems to be material for a NEP here, and some guidance from one of the numpy devs would be helpful in getting this somewhere.
Jaime
On Wed, Aug 27, 2014 at 10:35 AM, Eelco Hoogendoorn < hoogendoorn.eelco@gmail.com> wrote:
It wouldn't hurt to have this function, but my intuition is that its use will be minimal. If you are already working with sorted arrays, you already have a flop cost on that order of magnitude, and the optimized merge saves you a factor two at the very most. Using numpy means you are sacrificing factors of two and beyond relative to pure C left right and center anyway, so if this kind of thing matters to you, you probably wont be working in numpy in the first place.
That said, I share your interest in overhauling arraysetops. I see many opportunities for expanding its functionality. There is a question that amounts to 'how do I do groupby in numpy' on stackoverflow almost every week. That would have my top priority, but also things like extending np.unique to things like graph edges, or other more complex input, is very often useful to me.
Ive written up a draft http://pastebin.com/c5WLWPbpa while ago which accomplishes all of the above and more. It reimplements functions like np.unique around a common Index object. This index object encapsulates the precomputation (sorting) required for efficient setops on different datatypes, and provides a common interface to obtain the kind of information you are talking about (which is used extensively internally in the implementation of group_by, for instance).
ie, this functionality allows you to write neat things like group_by(randint(0,9,(100,2))).median(rand(100))
But I have the feeling much more could be done in this direction, and I feel this draft could really use a bit of back and forth. If we are going to completely rewrite arraysetops, we might as well do it right.
On Wed, Aug 27, 2014 at 7:02 PM, Jaime Fernández del Río < jaime.frio@gmail.com> wrote:
A request was open in github to add a `merge` function to numpy that would merge two sorted 1d arrays into a single sorted 1d array. I have been playing around with that idea for a while, and have a branch in my numpy fork that adds a `mergesorted` function to `numpy.lib`:
https://github.com/jaimefrio/numpy/commit/ce5d480afecc989a36e5d2bf4ea1d1ba58...
I drew inspiration from C++ STL algorithms, and merged into a single function what merge, set_union, set_intersection, set_difference and set_symmetric_difference do there.
My first thought when implementing this was to not make it a public function, but use it under the hood to speedup some of the functions of `arraysetops.py`, which are now merging two already sorted functions by doing `np.sort(np.concatenate((a, b)))`. I would need to revisit my testing, but the speedups weren't that great.
One other thing I saw value in for some of the `arraysetops.py` functions, but couldn't fully figure out, was in providing extra output aside from the merged arrays, either in the form of indices, or of boolean masks, indicating which items of the original arrays made it into the merged one, and/or where did they end up in it.
Since there is at least one other person out there that likes it, is there any more interest in such a function? If yes, any comments on what the proper interface for extra output should be? Although perhaps the best is to leave that out for starters and see what use people make of it, if any.
Jaime
 (__/) ( O.o) ( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial.
NumPyDiscussion mailing list NumPyDiscussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpydiscussion
NumPyDiscussion mailing list NumPyDiscussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpydiscussion
 (__/) ( O.o) ( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial.
NumPyDiscussion mailing list NumPyDiscussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpydiscussion
i.e, if the grouped axis is small but the other axes are not, you could write this, which avoids the python loop over the long axis that np.vectorize would otherwise perform.
import numpy as np from grouping import group_by keys = np.random.randint(0,4,10) values = np.random.rand(10,2000) for k,g in zip(*group_by(keys)(values)): print k, g.mean(0)
On Wed, Aug 27, 2014 at 9:29 PM, Eelco Hoogendoorn < hoogendoorn.eelco@gmail.com> wrote:
f.i., this works as expected as well (100 keys of 1d int arrays and 100 values of 1d float arrays):
group_by(randint(0,4,(100,2))).mean(rand(100,2))
On Wed, Aug 27, 2014 at 9:27 PM, Eelco Hoogendoorn < hoogendoorn.eelco@gmail.com> wrote:
If I understand you correctly, the current implementation supports these operations. All reductions over groups (except for median) are performed through the corresponding ufunc (see GroupBy.reduce). This works on multidimensional arrays as well, although this broadcasting over the nongrouping axes is accomplished using np.vectorize. Actual vectorization only happens over the axis being grouped over, but this is usually a long axis. If it isn't, it is more efficient to perform a reduction by means of splitting the array by its groups first, and then map the iterable of groups over some reduction operation (as noted in the docstring of GroupBy.reduce).
On Wed, Aug 27, 2014 at 8:29 PM, Jaime Fernández del Río < jaime.frio@gmail.com> wrote:
Hi Eelco,
I took a deeper look into your code a couple of weeks back. I don't think I have fully grasped what it allows completely, but I agree that some form of what you have there is highly desirable. Along the same lines, for sometime I have been thinking that the right place for a `groupby` in numpy is as a method of ufuncs, so that `np.add.groupby(arr, groups)` would do a multidimensional version of `np.bincount(groups, weights=arr)`. You would then need a more powerful version of `np.unique` to produce the `groups`, but that is something that Joe Kington's old PR was very close to achieving, that should probably be resurrected as well. But yes, there seems to be material for a NEP here, and some guidance from one of the numpy devs would be helpful in getting this somewhere.
Jaime
On Wed, Aug 27, 2014 at 10:35 AM, Eelco Hoogendoorn < hoogendoorn.eelco@gmail.com> wrote:
It wouldn't hurt to have this function, but my intuition is that its use will be minimal. If you are already working with sorted arrays, you already have a flop cost on that order of magnitude, and the optimized merge saves you a factor two at the very most. Using numpy means you are sacrificing factors of two and beyond relative to pure C left right and center anyway, so if this kind of thing matters to you, you probably wont be working in numpy in the first place.
That said, I share your interest in overhauling arraysetops. I see many opportunities for expanding its functionality. There is a question that amounts to 'how do I do groupby in numpy' on stackoverflow almost every week. That would have my top priority, but also things like extending np.unique to things like graph edges, or other more complex input, is very often useful to me.
Ive written up a draft http://pastebin.com/c5WLWPbpa while ago which accomplishes all of the above and more. It reimplements functions like np.unique around a common Index object. This index object encapsulates the precomputation (sorting) required for efficient setops on different datatypes, and provides a common interface to obtain the kind of information you are talking about (which is used extensively internally in the implementation of group_by, for instance).
ie, this functionality allows you to write neat things like group_by(randint(0,9,(100,2))).median(rand(100))
But I have the feeling much more could be done in this direction, and I feel this draft could really use a bit of back and forth. If we are going to completely rewrite arraysetops, we might as well do it right.
On Wed, Aug 27, 2014 at 7:02 PM, Jaime Fernández del Río < jaime.frio@gmail.com> wrote:
A request was open in github to add a `merge` function to numpy that would merge two sorted 1d arrays into a single sorted 1d array. I have been playing around with that idea for a while, and have a branch in my numpy fork that adds a `mergesorted` function to `numpy.lib`:
https://github.com/jaimefrio/numpy/commit/ce5d480afecc989a36e5d2bf4ea1d1ba58...
I drew inspiration from C++ STL algorithms, and merged into a single function what merge, set_union, set_intersection, set_difference and set_symmetric_difference do there.
My first thought when implementing this was to not make it a public function, but use it under the hood to speedup some of the functions of `arraysetops.py`, which are now merging two already sorted functions by doing `np.sort(np.concatenate((a, b)))`. I would need to revisit my testing, but the speedups weren't that great.
One other thing I saw value in for some of the `arraysetops.py` functions, but couldn't fully figure out, was in providing extra output aside from the merged arrays, either in the form of indices, or of boolean masks, indicating which items of the original arrays made it into the merged one, and/or where did they end up in it.
Since there is at least one other person out there that likes it, is there any more interest in such a function? If yes, any comments on what the proper interface for extra output should be? Although perhaps the best is to leave that out for starters and see what use people make of it, if any.
Jaime
 (__/) ( O.o) ( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial.
NumPyDiscussion mailing list NumPyDiscussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpydiscussion
NumPyDiscussion mailing list NumPyDiscussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpydiscussion
 (__/) ( O.o) ( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial.
NumPyDiscussion mailing list NumPyDiscussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpydiscussion
Yes, I was aware of that. But the point would be to provide true vectorization on those operations.
The way I see it, numpy may not have to have a GroupBy implementation, but it should at least enable implementing one that is fast and efficient over any axis.
On Wed, Aug 27, 2014 at 12:38 PM, Eelco Hoogendoorn < hoogendoorn.eelco@gmail.com> wrote:
i.e, if the grouped axis is small but the other axes are not, you could write this, which avoids the python loop over the long axis that np.vectorize would otherwise perform.
import numpy as np from grouping import group_by keys = np.random.randint(0,4,10) values = np.random.rand(10,2000) for k,g in zip(*group_by(keys)(values)): print k, g.mean(0)
On Wed, Aug 27, 2014 at 9:29 PM, Eelco Hoogendoorn < hoogendoorn.eelco@gmail.com> wrote:
f.i., this works as expected as well (100 keys of 1d int arrays and 100 values of 1d float arrays):
group_by(randint(0,4,(100,2))).mean(rand(100,2))
On Wed, Aug 27, 2014 at 9:27 PM, Eelco Hoogendoorn < hoogendoorn.eelco@gmail.com> wrote:
If I understand you correctly, the current implementation supports these operations. All reductions over groups (except for median) are performed through the corresponding ufunc (see GroupBy.reduce). This works on multidimensional arrays as well, although this broadcasting over the nongrouping axes is accomplished using np.vectorize. Actual vectorization only happens over the axis being grouped over, but this is usually a long axis. If it isn't, it is more efficient to perform a reduction by means of splitting the array by its groups first, and then map the iterable of groups over some reduction operation (as noted in the docstring of GroupBy.reduce).
On Wed, Aug 27, 2014 at 8:29 PM, Jaime Fernández del Río < jaime.frio@gmail.com> wrote:
Hi Eelco,
I took a deeper look into your code a couple of weeks back. I don't think I have fully grasped what it allows completely, but I agree that some form of what you have there is highly desirable. Along the same lines, for sometime I have been thinking that the right place for a `groupby` in numpy is as a method of ufuncs, so that `np.add.groupby(arr, groups)` would do a multidimensional version of `np.bincount(groups, weights=arr)`. You would then need a more powerful version of `np.unique` to produce the `groups`, but that is something that Joe Kington's old PR was very close to achieving, that should probably be resurrected as well. But yes, there seems to be material for a NEP here, and some guidance from one of the numpy devs would be helpful in getting this somewhere.
Jaime
On Wed, Aug 27, 2014 at 10:35 AM, Eelco Hoogendoorn < hoogendoorn.eelco@gmail.com> wrote:
It wouldn't hurt to have this function, but my intuition is that its use will be minimal. If you are already working with sorted arrays, you already have a flop cost on that order of magnitude, and the optimized merge saves you a factor two at the very most. Using numpy means you are sacrificing factors of two and beyond relative to pure C left right and center anyway, so if this kind of thing matters to you, you probably wont be working in numpy in the first place.
That said, I share your interest in overhauling arraysetops. I see many opportunities for expanding its functionality. There is a question that amounts to 'how do I do groupby in numpy' on stackoverflow almost every week. That would have my top priority, but also things like extending np.unique to things like graph edges, or other more complex input, is very often useful to me.
Ive written up a draft http://pastebin.com/c5WLWPbpa while ago which accomplishes all of the above and more. It reimplements functions like np.unique around a common Index object. This index object encapsulates the precomputation (sorting) required for efficient setops on different datatypes, and provides a common interface to obtain the kind of information you are talking about (which is used extensively internally in the implementation of group_by, for instance).
ie, this functionality allows you to write neat things like group_by(randint(0,9,(100,2))).median(rand(100))
But I have the feeling much more could be done in this direction, and I feel this draft could really use a bit of back and forth. If we are going to completely rewrite arraysetops, we might as well do it right.
On Wed, Aug 27, 2014 at 7:02 PM, Jaime Fernández del Río < jaime.frio@gmail.com> wrote:
A request was open in github to add a `merge` function to numpy that would merge two sorted 1d arrays into a single sorted 1d array. I have been playing around with that idea for a while, and have a branch in my numpy fork that adds a `mergesorted` function to `numpy.lib`:
https://github.com/jaimefrio/numpy/commit/ce5d480afecc989a36e5d2bf4ea1d1ba58...
I drew inspiration from C++ STL algorithms, and merged into a single function what merge, set_union, set_intersection, set_difference and set_symmetric_difference do there.
My first thought when implementing this was to not make it a public function, but use it under the hood to speedup some of the functions of `arraysetops.py`, which are now merging two already sorted functions by doing `np.sort(np.concatenate((a, b)))`. I would need to revisit my testing, but the speedups weren't that great.
One other thing I saw value in for some of the `arraysetops.py` functions, but couldn't fully figure out, was in providing extra output aside from the merged arrays, either in the form of indices, or of boolean masks, indicating which items of the original arrays made it into the merged one, and/or where did they end up in it.
Since there is at least one other person out there that likes it, is there any more interest in such a function? If yes, any comments on what the proper interface for extra output should be? Although perhaps the best is to leave that out for starters and see what use people make of it, if any.
Jaime
 (__/) ( O.o) ( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial.
NumPyDiscussion mailing list NumPyDiscussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpydiscussion
NumPyDiscussion mailing list NumPyDiscussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpydiscussion
 (__/) ( O.o) ( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial.
NumPyDiscussion mailing list NumPyDiscussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpydiscussion
NumPyDiscussion mailing list NumPyDiscussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpydiscussion
I just checked the docs on ufuncs, and it appears that's a solved problem now, since ufunc.reduceat now comes with an axis argument. Or maybe it already did when I wrote that, but I simply wasn't paying attention. Either way, the code is fully vectorized now, in both grouped and nongrouped axes. Its a lot of code, but all that happens for a grouping other than some O(1) and O(n) stuff is an argsort of the keys, and then the reduction itself, all fully vectorized.
Note that I sort the values first, and then use ufunc.reduceat on the groups. It would seem to me that ufunc.at should be more efficient, by avoiding this indirection, but testing very much revealed the opposite, for reasons unclear to me. Perhaps that's changed now as well.
On Wed, Aug 27, 2014 at 11:32 PM, Jaime Fernández del Río < jaime.frio@gmail.com> wrote:
Yes, I was aware of that. But the point would be to provide true vectorization on those operations.
The way I see it, numpy may not have to have a GroupBy implementation, but it should at least enable implementing one that is fast and efficient over any axis.
On Wed, Aug 27, 2014 at 12:38 PM, Eelco Hoogendoorn < hoogendoorn.eelco@gmail.com> wrote:
i.e, if the grouped axis is small but the other axes are not, you could write this, which avoids the python loop over the long axis that np.vectorize would otherwise perform.
import numpy as np from grouping import group_by keys = np.random.randint(0,4,10) values = np.random.rand(10,2000) for k,g in zip(*group_by(keys)(values)): print k, g.mean(0)
On Wed, Aug 27, 2014 at 9:29 PM, Eelco Hoogendoorn < hoogendoorn.eelco@gmail.com> wrote:
f.i., this works as expected as well (100 keys of 1d int arrays and 100 values of 1d float arrays):
group_by(randint(0,4,(100,2))).mean(rand(100,2))
On Wed, Aug 27, 2014 at 9:27 PM, Eelco Hoogendoorn < hoogendoorn.eelco@gmail.com> wrote:
If I understand you correctly, the current implementation supports these operations. All reductions over groups (except for median) are performed through the corresponding ufunc (see GroupBy.reduce). This works on multidimensional arrays as well, although this broadcasting over the nongrouping axes is accomplished using np.vectorize. Actual vectorization only happens over the axis being grouped over, but this is usually a long axis. If it isn't, it is more efficient to perform a reduction by means of splitting the array by its groups first, and then map the iterable of groups over some reduction operation (as noted in the docstring of GroupBy.reduce).
On Wed, Aug 27, 2014 at 8:29 PM, Jaime Fernández del Río < jaime.frio@gmail.com> wrote:
Hi Eelco,
I took a deeper look into your code a couple of weeks back. I don't think I have fully grasped what it allows completely, but I agree that some form of what you have there is highly desirable. Along the same lines, for sometime I have been thinking that the right place for a `groupby` in numpy is as a method of ufuncs, so that `np.add.groupby(arr, groups)` would do a multidimensional version of `np.bincount(groups, weights=arr)`. You would then need a more powerful version of `np.unique` to produce the `groups`, but that is something that Joe Kington's old PR was very close to achieving, that should probably be resurrected as well. But yes, there seems to be material for a NEP here, and some guidance from one of the numpy devs would be helpful in getting this somewhere.
Jaime
On Wed, Aug 27, 2014 at 10:35 AM, Eelco Hoogendoorn < hoogendoorn.eelco@gmail.com> wrote:
It wouldn't hurt to have this function, but my intuition is that its use will be minimal. If you are already working with sorted arrays, you already have a flop cost on that order of magnitude, and the optimized merge saves you a factor two at the very most. Using numpy means you are sacrificing factors of two and beyond relative to pure C left right and center anyway, so if this kind of thing matters to you, you probably wont be working in numpy in the first place.
That said, I share your interest in overhauling arraysetops. I see many opportunities for expanding its functionality. There is a question that amounts to 'how do I do groupby in numpy' on stackoverflow almost every week. That would have my top priority, but also things like extending np.unique to things like graph edges, or other more complex input, is very often useful to me.
Ive written up a draft http://pastebin.com/c5WLWPbpa while ago which accomplishes all of the above and more. It reimplements functions like np.unique around a common Index object. This index object encapsulates the precomputation (sorting) required for efficient setops on different datatypes, and provides a common interface to obtain the kind of information you are talking about (which is used extensively internally in the implementation of group_by, for instance).
ie, this functionality allows you to write neat things like group_by(randint(0,9,(100,2))).median(rand(100))
But I have the feeling much more could be done in this direction, and I feel this draft could really use a bit of back and forth. If we are going to completely rewrite arraysetops, we might as well do it right.
On Wed, Aug 27, 2014 at 7:02 PM, Jaime Fernández del Río < jaime.frio@gmail.com> wrote:
> A request was open in github to add a `merge` function to numpy that > would merge two sorted 1d arrays into a single sorted 1d array. I have been > playing around with that idea for a while, and have a branch in my numpy > fork that adds a `mergesorted` function to `numpy.lib`: > > > https://github.com/jaimefrio/numpy/commit/ce5d480afecc989a36e5d2bf4ea1d1ba58... > > I drew inspiration from C++ STL algorithms, and merged into a single > function what merge, set_union, set_intersection, set_difference and > set_symmetric_difference do there. > > My first thought when implementing this was to not make it a public > function, but use it under the hood to speedup some of the functions of > `arraysetops.py`, which are now merging two already sorted functions by > doing `np.sort(np.concatenate((a, b)))`. I would need to revisit my > testing, but the speedups weren't that great. > > One other thing I saw value in for some of the `arraysetops.py` > functions, but couldn't fully figure out, was in providing extra output > aside from the merged arrays, either in the form of indices, or of boolean > masks, indicating which items of the original arrays made it into the > merged one, and/or where did they end up in it. > > Since there is at least one other person out there that likes it, is > there any more interest in such a function? If yes, any comments on what > the proper interface for extra output should be? Although perhaps the best > is to leave that out for starters and see what use people make of it, if > any. > > Jaime > >  > (__/) > ( O.o) > ( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus > planes de dominación mundial. > > _______________________________________________ > NumPyDiscussion mailing list > NumPyDiscussion@scipy.org > http://mail.scipy.org/mailman/listinfo/numpydiscussion > >
NumPyDiscussion mailing list NumPyDiscussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpydiscussion
 (__/) ( O.o) ( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial.
NumPyDiscussion mailing list NumPyDiscussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpydiscussion
NumPyDiscussion mailing list NumPyDiscussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpydiscussion
 (__/) ( O.o) ( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial.
NumPyDiscussion mailing list NumPyDiscussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpydiscussion
Ive organized all code I had relating to this subject in a github repository https://github.com/EelcoHoogendoorn/Numpy_arraysetops_EP. That should facilitate shooting around ideas. Ive also added more documentation and structure to make it easier to see what is going on.
Hopefully we can converge on a common vision, and then improve the documentation and testing to make it worthy of including in the numpy master.
Note that there is also a complete rewrite of the classic numpy.arraysetops, such that they are also generalized to more complex input, such as finding unique graph edges, and so on.
You mentioned getting the numpy core developers involved; are they not subscribed to this mailing list? I wouldn't be surprised; youd hope there is a channel of discussion concerning development with higher signal to noise....
On Thu, Aug 28, 2014 at 1:49 AM, Eelco Hoogendoorn < hoogendoorn.eelco@gmail.com> wrote:
I just checked the docs on ufuncs, and it appears that's a solved problem now, since ufunc.reduceat now comes with an axis argument. Or maybe it already did when I wrote that, but I simply wasn't paying attention. Either way, the code is fully vectorized now, in both grouped and nongrouped axes. Its a lot of code, but all that happens for a grouping other than some O(1) and O(n) stuff is an argsort of the keys, and then the reduction itself, all fully vectorized.
Note that I sort the values first, and then use ufunc.reduceat on the groups. It would seem to me that ufunc.at should be more efficient, by avoiding this indirection, but testing very much revealed the opposite, for reasons unclear to me. Perhaps that's changed now as well.
On Wed, Aug 27, 2014 at 11:32 PM, Jaime Fernández del Río < jaime.frio@gmail.com> wrote:
Yes, I was aware of that. But the point would be to provide true vectorization on those operations.
The way I see it, numpy may not have to have a GroupBy implementation, but it should at least enable implementing one that is fast and efficient over any axis.
On Wed, Aug 27, 2014 at 12:38 PM, Eelco Hoogendoorn < hoogendoorn.eelco@gmail.com> wrote:
i.e, if the grouped axis is small but the other axes are not, you could write this, which avoids the python loop over the long axis that np.vectorize would otherwise perform.
import numpy as np from grouping import group_by keys = np.random.randint(0,4,10) values = np.random.rand(10,2000) for k,g in zip(*group_by(keys)(values)): print k, g.mean(0)
On Wed, Aug 27, 2014 at 9:29 PM, Eelco Hoogendoorn < hoogendoorn.eelco@gmail.com> wrote:
f.i., this works as expected as well (100 keys of 1d int arrays and 100 values of 1d float arrays):
group_by(randint(0,4,(100,2))).mean(rand(100,2))
On Wed, Aug 27, 2014 at 9:27 PM, Eelco Hoogendoorn < hoogendoorn.eelco@gmail.com> wrote:
If I understand you correctly, the current implementation supports these operations. All reductions over groups (except for median) are performed through the corresponding ufunc (see GroupBy.reduce). This works on multidimensional arrays as well, although this broadcasting over the nongrouping axes is accomplished using np.vectorize. Actual vectorization only happens over the axis being grouped over, but this is usually a long axis. If it isn't, it is more efficient to perform a reduction by means of splitting the array by its groups first, and then map the iterable of groups over some reduction operation (as noted in the docstring of GroupBy.reduce).
On Wed, Aug 27, 2014 at 8:29 PM, Jaime Fernández del Río < jaime.frio@gmail.com> wrote:
Hi Eelco,
I took a deeper look into your code a couple of weeks back. I don't think I have fully grasped what it allows completely, but I agree that some form of what you have there is highly desirable. Along the same lines, for sometime I have been thinking that the right place for a `groupby` in numpy is as a method of ufuncs, so that `np.add.groupby(arr, groups)` would do a multidimensional version of `np.bincount(groups, weights=arr)`. You would then need a more powerful version of `np.unique` to produce the `groups`, but that is something that Joe Kington's old PR was very close to achieving, that should probably be resurrected as well. But yes, there seems to be material for a NEP here, and some guidance from one of the numpy devs would be helpful in getting this somewhere.
Jaime
On Wed, Aug 27, 2014 at 10:35 AM, Eelco Hoogendoorn < hoogendoorn.eelco@gmail.com> wrote:
> It wouldn't hurt to have this function, but my intuition is that its > use will be minimal. If you are already working with sorted arrays, you > already have a flop cost on that order of magnitude, and the optimized > merge saves you a factor two at the very most. Using numpy means you are > sacrificing factors of two and beyond relative to pure C left right and > center anyway, so if this kind of thing matters to you, you probably wont > be working in numpy in the first place. > > That said, I share your interest in overhauling arraysetops. I see > many opportunities for expanding its functionality. There is a question > that amounts to 'how do I do groupby in numpy' on stackoverflow almost > every week. That would have my top priority, but also things like extending > np.unique to things like graph edges, or other more complex input, is very > often useful to me. > > Ive written up a draft http://pastebin.com/c5WLWPbpa while ago > which accomplishes all of the above and more. It reimplements functions > like np.unique around a common Index object. This index object encapsulates > the precomputation (sorting) required for efficient setops on different > datatypes, and provides a common interface to obtain the kind of > information you are talking about (which is used extensively internally in > the implementation of group_by, for instance). > > ie, this functionality allows you to write neat things like > group_by(randint(0,9,(100,2))).median(rand(100)) > > But I have the feeling much more could be done in this direction, > and I feel this draft could really use a bit of back and forth. If we are > going to completely rewrite arraysetops, we might as well do it right. > > > On Wed, Aug 27, 2014 at 7:02 PM, Jaime Fernández del Río < > jaime.frio@gmail.com> wrote: > >> A request was open in github to add a `merge` function to numpy >> that would merge two sorted 1d arrays into a single sorted 1d array. I have >> been playing around with that idea for a while, and have a branch in my >> numpy fork that adds a `mergesorted` function to `numpy.lib`: >> >> >> https://github.com/jaimefrio/numpy/commit/ce5d480afecc989a36e5d2bf4ea1d1ba58... >> >> I drew inspiration from C++ STL algorithms, and merged into a >> single function what merge, set_union, set_intersection, set_difference and >> set_symmetric_difference do there. >> >> My first thought when implementing this was to not make it a public >> function, but use it under the hood to speedup some of the functions of >> `arraysetops.py`, which are now merging two already sorted functions by >> doing `np.sort(np.concatenate((a, b)))`. I would need to revisit my >> testing, but the speedups weren't that great. >> >> One other thing I saw value in for some of the `arraysetops.py` >> functions, but couldn't fully figure out, was in providing extra output >> aside from the merged arrays, either in the form of indices, or of boolean >> masks, indicating which items of the original arrays made it into the >> merged one, and/or where did they end up in it. >> >> Since there is at least one other person out there that likes it, >> is there any more interest in such a function? If yes, any comments on what >> the proper interface for extra output should be? Although perhaps the best >> is to leave that out for starters and see what use people make of it, if >> any. >> >> Jaime >> >>  >> (__/) >> ( O.o) >> ( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus >> planes de dominación mundial. >> >> _______________________________________________ >> NumPyDiscussion mailing list >> NumPyDiscussion@scipy.org >> http://mail.scipy.org/mailman/listinfo/numpydiscussion >> >> > > _______________________________________________ > NumPyDiscussion mailing list > NumPyDiscussion@scipy.org > http://mail.scipy.org/mailman/listinfo/numpydiscussion > >
 (__/) ( O.o) ( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial.
NumPyDiscussion mailing list NumPyDiscussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpydiscussion
NumPyDiscussion mailing list NumPyDiscussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpydiscussion
 (__/) ( O.o) ( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial.
NumPyDiscussion mailing list NumPyDiscussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpydiscussion
On Sun, Aug 31, 2014 at 1:48 PM, Eelco Hoogendoorn < hoogendoorn.eelco@gmail.com> wrote:
Ive organized all code I had relating to this subject in a github repository https://github.com/EelcoHoogendoorn/Numpy_arraysetops_EP. That should facilitate shooting around ideas. Ive also added more documentation and structure to make it easier to see what is going on.
Hopefully we can converge on a common vision, and then improve the documentation and testing to make it worthy of including in the numpy master.
Note that there is also a complete rewrite of the classic numpy.arraysetops, such that they are also generalized to more complex input, such as finding unique graph edges, and so on.
You mentioned getting the numpy core developers involved; are they not subscribed to this mailing list? I wouldn't be surprised; youd hope there is a channel of discussion concerning development with higher signal to noise....
There are only about 2.5 of us at the moment. Those for whom this is an itch that need scratching should hash things out and make a PR. The main question for me is if it belongs in numpy, scipy, or somewhere else.
Chuck
Sure, id like to do the hashing things out, but I would also like some preliminary feedback as to whether this is going in a direction anyone else sees the point of, if it conflicts with other plans, and indeed if we can agree that numpy is the right place for it; a point which I would very much like to defend. If there is some obvious nogo that im missing, I can do without the drudgery of writing proper documentation ;).
As for whether this belongs in numpy: yes, I would say so. There are the extension of functionality to functions already in numpy, which are a nobrainer (it need not cost anything performance wise, and ive needed unique graph edges many many times), and there is the grouping functionality, which is the main novelty.
However, note that the grouping functionality itself is a very small addition, just a few 100 lines of pure python, given that the indexing logic has been factored out of the classic arraysetops. At least from a developers perspective, it very much feels like a logical extension of the same 'thing'.
But also from a conceptual numpy perspective, grouping is really more an 'elementary manipulation of an ndarray' than a 'special purpose algorithm'. It is useful for literally all kinds of programming; hence there is similar functionality in the python standard library (itertools.groupby); so why not have an efficient vectorized equivalent in numpy? It belongs there more than the linalg module, arguably.
Also, from a community perspective, a significant fraction of all stackoverflow numpy questions are (unknowingly) exactly about 'how to do grouping in numpy'.
On Mon, Sep 1, 2014 at 4:36 AM, Charles R Harris charlesr.harris@gmail.com wrote:
On Sun, Aug 31, 2014 at 1:48 PM, Eelco Hoogendoorn < hoogendoorn.eelco@gmail.com> wrote:
Ive organized all code I had relating to this subject in a github repository https://github.com/EelcoHoogendoorn/Numpy_arraysetops_EP. That should facilitate shooting around ideas. Ive also added more documentation and structure to make it easier to see what is going on.
Hopefully we can converge on a common vision, and then improve the documentation and testing to make it worthy of including in the numpy master.
Note that there is also a complete rewrite of the classic numpy.arraysetops, such that they are also generalized to more complex input, such as finding unique graph edges, and so on.
You mentioned getting the numpy core developers involved; are they not subscribed to this mailing list? I wouldn't be surprised; youd hope there is a channel of discussion concerning development with higher signal to noise....
There are only about 2.5 of us at the moment. Those for whom this is an itch that need scratching should hash things out and make a PR. The main question for me is if it belongs in numpy, scipy, or somewhere else.
Chuck
NumPyDiscussion mailing list NumPyDiscussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpydiscussion
On Mon, Sep 1, 2014 at 1:49 AM, Eelco Hoogendoorn < hoogendoorn.eelco@gmail.com> wrote:
Sure, id like to do the hashing things out, but I would also like some preliminary feedback as to whether this is going in a direction anyone else sees the point of, if it conflicts with other plans, and indeed if we can agree that numpy is the right place for it; a point which I would very much like to defend. If there is some obvious nogo that im missing, I can do without the drudgery of writing proper documentation ;).
As for whether this belongs in numpy: yes, I would say so. There are the extension of functionality to functions already in numpy, which are a nobrainer (it need not cost anything performance wise, and ive needed unique graph edges many many times), and there is the grouping functionality, which is the main novelty.
However, note that the grouping functionality itself is a very small addition, just a few 100 lines of pure python, given that the indexing logic has been factored out of the classic arraysetops. At least from a developers perspective, it very much feels like a logical extension of the same 'thing'.
But also from a conceptual numpy perspective, grouping is really more an 'elementary manipulation of an ndarray' than a 'special purpose algorithm'. It is useful for literally all kinds of programming; hence there is similar functionality in the python standard library (itertools.groupby); so why not have an efficient vectorized equivalent in numpy? It belongs there more than the linalg module, arguably.
Also, from a community perspective, a significant fraction of all stackoverflow numpy questions are (unknowingly) exactly about 'how to do grouping in numpy'.
What I'm trying to say is that numpy is a community project. We don't have a central planning committee, the only difference between "developers" and everyone else is activity and commit rights. Which is to say if you develop and push this topic it is likely to go in. There certainly seems to be interest in this functionality. The reason that I brought up scipy is that there are some graph algorithms there that went in a couple of years ago.
Note that the convention on the list is bottom posting.
<snip>
Chuck
On Mon, Sep 1, 2014 at 2:05 PM, Charles R Harris charlesr.harris@gmail.com wrote:
On Mon, Sep 1, 2014 at 1:49 AM, Eelco Hoogendoorn < hoogendoorn.eelco@gmail.com> wrote:
Sure, id like to do the hashing things out, but I would also like some preliminary feedback as to whether this is going in a direction anyone else sees the point of, if it conflicts with other plans, and indeed if we can agree that numpy is the right place for it; a point which I would very much like to defend. If there is some obvious nogo that im missing, I can do without the drudgery of writing proper documentation ;).
As for whether this belongs in numpy: yes, I would say so. There are the extension of functionality to functions already in numpy, which are a nobrainer (it need not cost anything performance wise, and ive needed unique graph edges many many times), and there is the grouping functionality, which is the main novelty.
However, note that the grouping functionality itself is a very small addition, just a few 100 lines of pure python, given that the indexing logic has been factored out of the classic arraysetops. At least from a developers perspective, it very much feels like a logical extension of the same 'thing'.
But also from a conceptual numpy perspective, grouping is really more an 'elementary manipulation of an ndarray' than a 'special purpose algorithm'. It is useful for literally all kinds of programming; hence there is similar functionality in the python standard library (itertools.groupby); so why not have an efficient vectorized equivalent in numpy? It belongs there more than the linalg module, arguably.
Also, from a community perspective, a significant fraction of all stackoverflow numpy questions are (unknowingly) exactly about 'how to do grouping in numpy'.
What I'm trying to say is that numpy is a community project. We don't have a central planning committee, the only difference between "developers" and everyone else is activity and commit rights. Which is to say if you develop and push this topic it is likely to go in. There certainly seems to be interest in this functionality. The reason that I brought up scipy is that there are some graph algorithms there that went in a couple of years ago.
Note that the convention on the list is bottom posting.
<snip>
Chuck
NumPyDiscussion mailing list NumPyDiscussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpydiscussion
I understand that numpy is a community project, so that the decision isn't up to any one particular person; but some early stage feedback from those active in the community would be welcome. I am generally confident that this addition makes sense, but I have not contributed to numpy before, and you don't know what you don't know and all... given that there are multiple suggestions for changing arraysetops, some coordination would be useful I think.
Note that I use graph edges merely as an example; the proposed functionality is much more general than graphing algorithms specifically. The radial reduction https://github.com/EelcoHoogendoorn/Numpy_arraysetops_EP/blob/master/examples.pyexample I included on github is particularly illustrative of the general utility of grouping functionality I think. Operations like radial reductions are rather common, and a custom implementation is quite lengthy, very bug prone, and potentially very slow.
Thanks for the heads up on posting convention; ive always let gmail do my thinking for me, which works well enough for me, but I can see how not following this convention is annoying to others.
On Mon, Sep 1, 2014 at 7:58 AM, Eelco Hoogendoorn < hoogendoorn.eelco@gmail.com> wrote:
On Mon, Sep 1, 2014 at 2:05 PM, Charles R Harris < charlesr.harris@gmail.com> wrote:
On Mon, Sep 1, 2014 at 1:49 AM, Eelco Hoogendoorn < hoogendoorn.eelco@gmail.com> wrote:
Sure, id like to do the hashing things out, but I would also like some preliminary feedback as to whether this is going in a direction anyone else sees the point of, if it conflicts with other plans, and indeed if we can agree that numpy is the right place for it; a point which I would very much like to defend. If there is some obvious nogo that im missing, I can do without the drudgery of writing proper documentation ;).
As for whether this belongs in numpy: yes, I would say so. There are the extension of functionality to functions already in numpy, which are a nobrainer (it need not cost anything performance wise, and ive needed unique graph edges many many times), and there is the grouping functionality, which is the main novelty.
However, note that the grouping functionality itself is a very small addition, just a few 100 lines of pure python, given that the indexing logic has been factored out of the classic arraysetops. At least from a developers perspective, it very much feels like a logical extension of the same 'thing'.
But also from a conceptual numpy perspective, grouping is really more an 'elementary manipulation of an ndarray' than a 'special purpose algorithm'. It is useful for literally all kinds of programming; hence there is similar functionality in the python standard library (itertools.groupby); so why not have an efficient vectorized equivalent in numpy? It belongs there more than the linalg module, arguably.
Also, from a community perspective, a significant fraction of all stackoverflow numpy questions are (unknowingly) exactly about 'how to do grouping in numpy'.
What I'm trying to say is that numpy is a community project. We don't have a central planning committee, the only difference between "developers" and everyone else is activity and commit rights. Which is to say if you develop and push this topic it is likely to go in. There certainly seems to be interest in this functionality. The reason that I brought up scipy is that there are some graph algorithms there that went in a couple of years ago.
Note that the convention on the list is bottom posting.
<snip>
Chuck
NumPyDiscussion mailing list NumPyDiscussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpydiscussion
I understand that numpy is a community project, so that the decision isn't up to any one particular person; but some early stage feedback from those active in the community would be welcome. I am generally confident that this addition makes sense, but I have not contributed to numpy before, and you don't know what you don't know and all... given that there are multiple suggestions for changing arraysetops, some coordination would be useful I think.
Note that I use graph edges merely as an example; the proposed functionality is much more general than graphing algorithms specifically. The radial reduction https://github.com/EelcoHoogendoorn/Numpy_arraysetops_EP/blob/master/examples.pyexample I included on github is particularly illustrative of the general utility of grouping functionality I think. Operations like radial reductions are rather common, and a custom implementation is quite lengthy, very bug prone, and potentially very slow.
Thanks for the heads up on posting convention; ive always let gmail do my thinking for me, which works well enough for me, but I can see how not following this convention is annoying to others.
What do you think about the suggestion of timsort? One would need to concatenate the arrays before sorting, but it should be fairly efficient.
Chuck
On Tue, Sep 2, 2014 at 5:40 PM, Charles R Harris charlesr.harris@gmail.com wrote:
What do you think about the suggestion of timsort? One would need to concatenate the arrays before sorting, but it should be fairly efficient.
Timsort is very cool, and it would definitely be fun to implement in numpy. It is also a lot more work that merging two sorted arrays! I think +1 if someone else does it, but although I would love to be part of it, I am not sure I will be able to find time to look into it seriously in the next couple of months.
From a setops point of view, merging two sorted arrays makes it very
straightforward to compute, together with (or instead of) the result of the merge, index arrays that let you calculate things like `in1d` faster. Although perhaps an `argtimsort` could provide the same functionality, not sure. I will probably wrap up what I have, put a lace on it, and submit it as a PR. Even if it is not destined to be merged, it may serve as a warning to others.
I have also been thinking lately that one of the problems with all these uniquerelated computations may be a case of having a hammer and seeing everything as nails. Perhaps the algorithm that needs to be ported from Python is not the sorting one, but the hash table...
Jaime
On Wed, Sep 3, 2014 at 4:07 AM, Jaime Fernández del Río < jaime.frio@gmail.com> wrote:
On Tue, Sep 2, 2014 at 5:40 PM, Charles R Harris < charlesr.harris@gmail.com> wrote:
What do you think about the suggestion of timsort? One would need to concatenate the arrays before sorting, but it should be fairly efficient.
Timsort is very cool, and it would definitely be fun to implement in numpy. It is also a lot more work that merging two sorted arrays! I think +1 if someone else does it, but although I would love to be part of it, I am not sure I will be able to find time to look into it seriously in the next couple of months.
From a setops point of view, merging two sorted arrays makes it very straightforward to compute, together with (or instead of) the result of the merge, index arrays that let you calculate things like `in1d` faster. Although perhaps an `argtimsort` could provide the same functionality, not sure. I will probably wrap up what I have, put a lace on it, and submit it as a PR. Even if it is not destined to be merged, it may serve as a warning to others.
I have also been thinking lately that one of the problems with all these uniquerelated computations may be a case of having a hammer and seeing everything as nails. Perhaps the algorithm that needs to be ported from Python is not the sorting one, but the hash table...
Jaime
Not sure about the hashing. Indeed one can also build an index of a set by means of a hash table, but its questionable if this leads to improved performance over performing an argsort. Hashing may have better asymptotic time complexity in theory, but many datasets used in practice are very easy to sort (O(N)ish), and the timeconstant of hashing is higher. But more importantly, using a hashtable guarantees poor cache behavior for many operations using this index. By contrast, sorting may (but need not) make one random access pass to build the index, and may (but need not) perform one random access to reorder values for grouping. But insofar as the keys are better behaved than pure random, this coherence will be exploited.
Also, getting the unique values/keys in sorted order is a nice sidebenefit for many applications.
On Wed, Sep 3, 2014 at 6:41 AM, Eelco Hoogendoorn < hoogendoorn.eelco@gmail.com> wrote:
Not sure about the hashing. Indeed one can also build an index of a set by means of a hash table, but its questionable if this leads to improved performance over performing an argsort. Hashing may have better asymptotic time complexity in theory, but many datasets used in practice are very easy to sort (O(N)ish), and the timeconstant of hashing is higher. But more importantly, using a hashtable guarantees poor cache behavior for many operations using this index. By contrast, sorting may (but need not) make one random access pass to build the index, and may (but need not) perform one random access to reorder values for grouping. But insofar as the keys are better behaved than pure random, this coherence will be exploited.
If you want to give it a try, these branch of my numpy fork has hash table based implementations of unique (with no extra indices) and in1d:
https://github.com/jaimefrio/numpy/tree/hashunique
A use cases where the hash table is clearly better:
In [1]: import numpy as np In [2]: from numpy.lib._compiled_base import _unique, _in1d
In [3]: a = np.random.randint(10, size=(10000,)) In [4]: %timeit np.unique(a) 1000 loops, best of 3: 258 us per loop In [5]: %timeit _unique(a) 10000 loops, best of 3: 143 us per loop In [6]: %timeit np.sort(_unique(a)) 10000 loops, best of 3: 149 us per loop
It typically performs between 1.5x and 4x faster than sorting. I haven't profiled it properly to know, but there may be quite a bit of performance to dig out: have type specific comparison functions, optimize the starting hash table size based on the size of the array to avoid reinsertions...
If getting the elements sorted is a necessity, and the array contains very few or no repeated items, then the hash table approach may even perform worse,:
In [8]: a = np.random.randint(10000, size=(5000,)) In [9]: %timeit np.unique(a) 1000 loops, best of 3: 277 us per loop In [10]: %timeit np.sort(_unique(a)) 1000 loops, best of 3: 320 us per loop
But the hash table still wins in extracting the unique items only:
In [11]: %timeit _unique(a) 10000 loops, best of 3: 187 us per loop
Where the hash table shines is in more elaborate situations. If you keep the first index where it was found, and the number of repeats, in the hash table, you can get return_index and return_counts almost for free, which means you are performing an extra 3x faster than with sorting. return_inverse requires a little more trickery, so I won;t attempt to quantify the improvement. But I wouldn't be surprised if, after fine tuning it, there is close to an order of magnitude overall improvement
The sppedup for in1d is also nice:
In [16]: a = np.random.randint(1000, size=(1000,)) In [17]: b = np.random.randint(1000, size=(500,)) In [18]: %timeit np.in1d(a, b) 1000 loops, best of 3: 178 us per loop In [19]: %timeit _in1d(a, b) 10000 loops, best of 3: 30.1 us per loop
Of course, there is no point in
On Wed, Sep 3, 2014 at 9:33 AM, Jaime Fernández del Río < jaime.frio@gmail.com> wrote:
On Wed, Sep 3, 2014 at 6:41 AM, Eelco Hoogendoorn < hoogendoorn.eelco@gmail.com> wrote:
Not sure about the hashing. Indeed one can also build an index of a set by means of a hash table, but its questionable if this leads to improved performance over performing an argsort. Hashing may have better asymptotic time complexity in theory, but many datasets used in practice are very easy to sort (O(N)ish), and the timeconstant of hashing is higher. But more importantly, using a hashtable guarantees poor cache behavior for many operations using this index. By contrast, sorting may (but need not) make one random access pass to build the index, and may (but need not) perform one random access to reorder values for grouping. But insofar as the keys are better behaved than pure random, this coherence will be exploited.
If you want to give it a try, these branch of my numpy fork has hash table based implementations of unique (with no extra indices) and in1d:
https://github.com/jaimefrio/numpy/tree/hashunique
A use cases where the hash table is clearly better:
In [1]: import numpy as np In [2]: from numpy.lib._compiled_base import _unique, _in1d
In [3]: a = np.random.randint(10, size=(10000,)) In [4]: %timeit np.unique(a) 1000 loops, best of 3: 258 us per loop In [5]: %timeit _unique(a) 10000 loops, best of 3: 143 us per loop In [6]: %timeit np.sort(_unique(a)) 10000 loops, best of 3: 149 us per loop
It typically performs between 1.5x and 4x faster than sorting. I haven't profiled it properly to know, but there may be quite a bit of performance to dig out: have type specific comparison functions, optimize the starting hash table size based on the size of the array to avoid reinsertions...
If getting the elements sorted is a necessity, and the array contains very few or no repeated items, then the hash table approach may even perform worse,:
In [8]: a = np.random.randint(10000, size=(5000,)) In [9]: %timeit np.unique(a) 1000 loops, best of 3: 277 us per loop In [10]: %timeit np.sort(_unique(a)) 1000 loops, best of 3: 320 us per loop
But the hash table still wins in extracting the unique items only:
In [11]: %timeit _unique(a) 10000 loops, best of 3: 187 us per loop
Where the hash table shines is in more elaborate situations. If you keep the first index where it was found, and the number of repeats, in the hash table, you can get return_index and return_counts almost for free, which means you are performing an extra 3x faster than with sorting. return_inverse requires a little more trickery, so I won;t attempt to quantify the improvement. But I wouldn't be surprised if, after fine tuning it, there is close to an order of magnitude overall improvement
The sppedup for in1d is also nice:
In [16]: a = np.random.randint(1000, size=(1000,)) In [17]: b = np.random.randint(1000, size=(500,)) In [18]: %timeit np.in1d(a, b) 1000 loops, best of 3: 178 us per loop In [19]: %timeit _in1d(a, b) 10000 loops, best of 3: 30.1 us per loop
Of course, there is no point in
Ooops!!! Hit the send button too quick. Not to extend myself too long: if we are going to rethink all of this, we should approach it with an open mind. Still, and this post is not helping with that either, I am afraid that we are discussing implementation details, but are missing a broader vision of what we want to accomplish and why. That vision of what numpy's grouping functionality, if any, should be, and how it complements or conflicts with what pandas is providing, should precede anything else. I now I haven't, but has anyone looked at how pandas implements grouping? Their documentation on the subject is well worth a read:
http://pandas.pydata.org/pandasdocs/stable/groupby.html
Does numpy need to replicate this? What/why/how can we add to that?
Jaime
On Wed, Sep 3, 2014 at 6:46 PM, Jaime Fernández del Río < jaime.frio@gmail.com> wrote:
On Wed, Sep 3, 2014 at 9:33 AM, Jaime Fernández del Río < jaime.frio@gmail.com> wrote:
On Wed, Sep 3, 2014 at 6:41 AM, Eelco Hoogendoorn < hoogendoorn.eelco@gmail.com> wrote:
Not sure about the hashing. Indeed one can also build an index of a set by means of a hash table, but its questionable if this leads to improved performance over performing an argsort. Hashing may have better asymptotic time complexity in theory, but many datasets used in practice are very easy to sort (O(N)ish), and the timeconstant of hashing is higher. But more importantly, using a hashtable guarantees poor cache behavior for many operations using this index. By contrast, sorting may (but need not) make one random access pass to build the index, and may (but need not) perform one random access to reorder values for grouping. But insofar as the keys are better behaved than pure random, this coherence will be exploited.
If you want to give it a try, these branch of my numpy fork has hash table based implementations of unique (with no extra indices) and in1d:
https://github.com/jaimefrio/numpy/tree/hashunique
A use cases where the hash table is clearly better:
In [1]: import numpy as np In [2]: from numpy.lib._compiled_base import _unique, _in1d
In [3]: a = np.random.randint(10, size=(10000,)) In [4]: %timeit np.unique(a) 1000 loops, best of 3: 258 us per loop In [5]: %timeit _unique(a) 10000 loops, best of 3: 143 us per loop In [6]: %timeit np.sort(_unique(a)) 10000 loops, best of 3: 149 us per loop
It typically performs between 1.5x and 4x faster than sorting. I haven't profiled it properly to know, but there may be quite a bit of performance to dig out: have type specific comparison functions, optimize the starting hash table size based on the size of the array to avoid reinsertions...
If getting the elements sorted is a necessity, and the array contains very few or no repeated items, then the hash table approach may even perform worse,:
In [8]: a = np.random.randint(10000, size=(5000,)) In [9]: %timeit np.unique(a) 1000 loops, best of 3: 277 us per loop In [10]: %timeit np.sort(_unique(a)) 1000 loops, best of 3: 320 us per loop
But the hash table still wins in extracting the unique items only:
In [11]: %timeit _unique(a) 10000 loops, best of 3: 187 us per loop
Where the hash table shines is in more elaborate situations. If you keep the first index where it was found, and the number of repeats, in the hash table, you can get return_index and return_counts almost for free, which means you are performing an extra 3x faster than with sorting. return_inverse requires a little more trickery, so I won;t attempt to quantify the improvement. But I wouldn't be surprised if, after fine tuning it, there is close to an order of magnitude overall improvement
The sppedup for in1d is also nice:
In [16]: a = np.random.randint(1000, size=(1000,)) In [17]: b = np.random.randint(1000, size=(500,)) In [18]: %timeit np.in1d(a, b) 1000 loops, best of 3: 178 us per loop In [19]: %timeit _in1d(a, b) 10000 loops, best of 3: 30.1 us per loop
Of course, there is no point in
Ooops!!! Hit the send button too quick. Not to extend myself too long: if we are going to rethink all of this, we should approach it with an open mind. Still, and this post is not helping with that either, I am afraid that we are discussing implementation details, but are missing a broader vision of what we want to accomplish and why. That vision of what numpy's grouping functionality, if any, should be, and how it complements or conflicts with what pandas is providing, should precede anything else. I now I haven't, but has anyone looked at how pandas implements grouping? Their documentation on the subject is well worth a read:
http://pandas.pydata.org/pandasdocs/stable/groupby.html
Does numpy need to replicate this? What/why/how can we add to that?
Jaime
 (__/) ( O.o) ( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial.
NumPyDiscussion mailing list NumPyDiscussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpydiscussion
I would certainly not be opposed to having a hashing based indexing
mechanism; I think it would make sense designwise to have a HashIndex class with the same interface as the rest, and use that subclass in those arraysetops where it makes sense. The 'how to' of indexing and its applications are largely orthogonal I think (with some tiny performance compromises which are worth the abstraction imo). For datasets which are not purely random, have many unique items, and which do not fit into cache, I would expect sorting to come out on top, but indeed it depends on the dataset.
Yeah, the question how pandas does grouping, and whether we can do better, is a relevant one.
From what I understand, pandas relies on cython extensions to get
vectorized grouping functionality. This is no longer necessary since the introduction of ufuncs in numpy. I don't know how the implementations compare in terms of performance, but I doubt the difference is huge.
I personally use grouping a lot in my code, and I don't like having to use pandas for it. Most importantly, I don't want to go around creating a dataframe for a single oneline hitandrun association between keys and values. The permanent association of different types of data and their metadata which pandas offers is I think the key difference from numpy, which is all about manipulating just plain ndarrays. Arguably, grouping itself is a pretty elementary manipulating of ndarrays, and adding calls to DataFrame or Series inbetween a statement that could just be simply group_by(keys).mean(values) feels wrong to me. As does including pandas as a dependency just to use this small piece of functionality. Grouping is a more general functionality than any particular method of organizing your data.
In terms of features, adding transformations and filtering might be nice too; I hadn't thought about it, but that is because unlike the currently implemented features, the need has never arose for me. Im only a small sample size, and I don't see any fundamental objection to adding such functionality though. It certainly raises the question as to where to draw the line with pandas; but my rule of thumb is that if you can think of it as an elementary operation on ndarrays, then it probably belongs in numpy.
On Thu, Sep 4, 2014 at 10:31 AM, Eelco Hoogendoorn < hoogendoorn.eelco@gmail.com> wrote:
On Wed, Sep 3, 2014 at 6:46 PM, Jaime Fernández del Río < jaime.frio@gmail.com> wrote:
On Wed, Sep 3, 2014 at 9:33 AM, Jaime Fernández del Río < jaime.frio@gmail.com> wrote:
On Wed, Sep 3, 2014 at 6:41 AM, Eelco Hoogendoorn < hoogendoorn.eelco@gmail.com> wrote:
Not sure about the hashing. Indeed one can also build an index of a set by means of a hash table, but its questionable if this leads to improved performance over performing an argsort. Hashing may have better asymptotic time complexity in theory, but many datasets used in practice are very easy to sort (O(N)ish), and the timeconstant of hashing is higher. But more importantly, using a hashtable guarantees poor cache behavior for many operations using this index. By contrast, sorting may (but need not) make one random access pass to build the index, and may (but need not) perform one random access to reorder values for grouping. But insofar as the keys are better behaved than pure random, this coherence will be exploited.
If you want to give it a try, these branch of my numpy fork has hash table based implementations of unique (with no extra indices) and in1d:
https://github.com/jaimefrio/numpy/tree/hashunique
A use cases where the hash table is clearly better:
In [1]: import numpy as np In [2]: from numpy.lib._compiled_base import _unique, _in1d
In [3]: a = np.random.randint(10, size=(10000,)) In [4]: %timeit np.unique(a) 1000 loops, best of 3: 258 us per loop In [5]: %timeit _unique(a) 10000 loops, best of 3: 143 us per loop In [6]: %timeit np.sort(_unique(a)) 10000 loops, best of 3: 149 us per loop
It typically performs between 1.5x and 4x faster than sorting. I haven't profiled it properly to know, but there may be quite a bit of performance to dig out: have type specific comparison functions, optimize the starting hash table size based on the size of the array to avoid reinsertions...
If getting the elements sorted is a necessity, and the array contains very few or no repeated items, then the hash table approach may even perform worse,:
In [8]: a = np.random.randint(10000, size=(5000,)) In [9]: %timeit np.unique(a) 1000 loops, best of 3: 277 us per loop In [10]: %timeit np.sort(_unique(a)) 1000 loops, best of 3: 320 us per loop
But the hash table still wins in extracting the unique items only:
In [11]: %timeit _unique(a) 10000 loops, best of 3: 187 us per loop
Where the hash table shines is in more elaborate situations. If you keep the first index where it was found, and the number of repeats, in the hash table, you can get return_index and return_counts almost for free, which means you are performing an extra 3x faster than with sorting. return_inverse requires a little more trickery, so I won;t attempt to quantify the improvement. But I wouldn't be surprised if, after fine tuning it, there is close to an order of magnitude overall improvement
The sppedup for in1d is also nice:
In [16]: a = np.random.randint(1000, size=(1000,)) In [17]: b = np.random.randint(1000, size=(500,)) In [18]: %timeit np.in1d(a, b) 1000 loops, best of 3: 178 us per loop In [19]: %timeit _in1d(a, b) 10000 loops, best of 3: 30.1 us per loop
Of course, there is no point in
Ooops!!! Hit the send button too quick. Not to extend myself too long: if we are going to rethink all of this, we should approach it with an open mind. Still, and this post is not helping with that either, I am afraid that we are discussing implementation details, but are missing a broader vision of what we want to accomplish and why. That vision of what numpy's grouping functionality, if any, should be, and how it complements or conflicts with what pandas is providing, should precede anything else. I now I haven't, but has anyone looked at how pandas implements grouping? Their documentation on the subject is well worth a read:
http://pandas.pydata.org/pandasdocs/stable/groupby.html
Does numpy need to replicate this? What/why/how can we add to that?
Jaime
 (__/) ( O.o) ( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial.
NumPyDiscussion mailing list NumPyDiscussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpydiscussion
I would certainly not be opposed to having a hashing based indexing
mechanism; I think it would make sense designwise to have a HashIndex class with the same interface as the rest, and use that subclass in those arraysetops where it makes sense. The 'how to' of indexing and its applications are largely orthogonal I think (with some tiny performance compromises which are worth the abstraction imo). For datasets which are not purely random, have many unique items, and which do not fit into cache, I would expect sorting to come out on top, but indeed it depends on the dataset.
Yeah, the question how pandas does grouping, and whether we can do better, is a relevant one.
From what I understand, pandas relies on cython extensions to get vectorized grouping functionality. This is no longer necessary since the introduction of ufuncs in numpy. I don't know how the implementations compare in terms of performance, but I doubt the difference is huge.
I personally use grouping a lot in my code, and I don't like having to use pandas for it. Most importantly, I don't want to go around creating a dataframe for a single oneline hitandrun association between keys and values. The permanent association of different types of data and their metadata which pandas offers is I think the key difference from numpy, which is all about manipulating just plain ndarrays. Arguably, grouping itself is a pretty elementary manipulating of ndarrays, and adding calls to DataFrame or Series inbetween a statement that could just be simply group_by(keys).mean(values) feels wrong to me. As does including pandas as a dependency just to use this small piece of functionality. Grouping is a more general functionality than any particular method of organizing your data.
In terms of features, adding transformations and filtering might be nice too; I hadn't thought about it, but that is because unlike the currently implemented features, the need has never arose for me. Im only a small sample size, and I don't see any fundamental objection to adding such functionality though. It certainly raises the question as to where to draw the line with pandas; but my rule of thumb is that if you can think of it as an elementary operation on ndarrays, then it probably belongs in numpy.
Oh I forgot to add: with an indexing mechanism based on sorting, unique values and counts also come 'for free', not counting the O(N) cost of actually creating those arrays. The only time an operating relying on an index incurs another nontrivial amount of overhead is in case its 'rank' or 'inverse' property is used, which invokes another argsort. But for the vast majority of grouping or set operations, these properties are never used.
On Thu, Sep 4, 2014 at 10:39 AM, Eelco Hoogendoorn < hoogendoorn.eelco@gmail.com> wrote:
On Thu, Sep 4, 2014 at 10:31 AM, Eelco Hoogendoorn < hoogendoorn.eelco@gmail.com> wrote:
On Wed, Sep 3, 2014 at 6:46 PM, Jaime Fernández del Río < jaime.frio@gmail.com> wrote:
On Wed, Sep 3, 2014 at 9:33 AM, Jaime Fernández del Río < jaime.frio@gmail.com> wrote:
On Wed, Sep 3, 2014 at 6:41 AM, Eelco Hoogendoorn < hoogendoorn.eelco@gmail.com> wrote:
Not sure about the hashing. Indeed one can also build an index of a set by means of a hash table, but its questionable if this leads to improved performance over performing an argsort. Hashing may have better asymptotic time complexity in theory, but many datasets used in practice are very easy to sort (O(N)ish), and the timeconstant of hashing is higher. But more importantly, using a hashtable guarantees poor cache behavior for many operations using this index. By contrast, sorting may (but need not) make one random access pass to build the index, and may (but need not) perform one random access to reorder values for grouping. But insofar as the keys are better behaved than pure random, this coherence will be exploited.
If you want to give it a try, these branch of my numpy fork has hash table based implementations of unique (with no extra indices) and in1d:
https://github.com/jaimefrio/numpy/tree/hashunique
A use cases where the hash table is clearly better:
In [1]: import numpy as np In [2]: from numpy.lib._compiled_base import _unique, _in1d
In [3]: a = np.random.randint(10, size=(10000,)) In [4]: %timeit np.unique(a) 1000 loops, best of 3: 258 us per loop In [5]: %timeit _unique(a) 10000 loops, best of 3: 143 us per loop In [6]: %timeit np.sort(_unique(a)) 10000 loops, best of 3: 149 us per loop
It typically performs between 1.5x and 4x faster than sorting. I haven't profiled it properly to know, but there may be quite a bit of performance to dig out: have type specific comparison functions, optimize the starting hash table size based on the size of the array to avoid reinsertions...
If getting the elements sorted is a necessity, and the array contains very few or no repeated items, then the hash table approach may even perform worse,:
In [8]: a = np.random.randint(10000, size=(5000,)) In [9]: %timeit np.unique(a) 1000 loops, best of 3: 277 us per loop In [10]: %timeit np.sort(_unique(a)) 1000 loops, best of 3: 320 us per loop
But the hash table still wins in extracting the unique items only:
In [11]: %timeit _unique(a) 10000 loops, best of 3: 187 us per loop
Where the hash table shines is in more elaborate situations. If you keep the first index where it was found, and the number of repeats, in the hash table, you can get return_index and return_counts almost for free, which means you are performing an extra 3x faster than with sorting. return_inverse requires a little more trickery, so I won;t attempt to quantify the improvement. But I wouldn't be surprised if, after fine tuning it, there is close to an order of magnitude overall improvement
The sppedup for in1d is also nice:
In [16]: a = np.random.randint(1000, size=(1000,)) In [17]: b = np.random.randint(1000, size=(500,)) In [18]: %timeit np.in1d(a, b) 1000 loops, best of 3: 178 us per loop In [19]: %timeit _in1d(a, b) 10000 loops, best of 3: 30.1 us per loop
Of course, there is no point in
Ooops!!! Hit the send button too quick. Not to extend myself too long: if we are going to rethink all of this, we should approach it with an open mind. Still, and this post is not helping with that either, I am afraid that we are discussing implementation details, but are missing a broader vision of what we want to accomplish and why. That vision of what numpy's grouping functionality, if any, should be, and how it complements or conflicts with what pandas is providing, should precede anything else. I now I haven't, but has anyone looked at how pandas implements grouping? Their documentation on the subject is well worth a read:
http://pandas.pydata.org/pandasdocs/stable/groupby.html
Does numpy need to replicate this? What/why/how can we add to that?
Jaime
 (__/) ( O.o) ( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial.
NumPyDiscussion mailing list NumPyDiscussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpydiscussion
I would certainly not be opposed to having a hashing based indexing
mechanism; I think it would make sense designwise to have a HashIndex class with the same interface as the rest, and use that subclass in those arraysetops where it makes sense. The 'how to' of indexing and its applications are largely orthogonal I think (with some tiny performance compromises which are worth the abstraction imo). For datasets which are not purely random, have many unique items, and which do not fit into cache, I would expect sorting to come out on top, but indeed it depends on the dataset.
Yeah, the question how pandas does grouping, and whether we can do better, is a relevant one.
From what I understand, pandas relies on cython extensions to get vectorized grouping functionality. This is no longer necessary since the introduction of ufuncs in numpy. I don't know how the implementations compare in terms of performance, but I doubt the difference is huge.
I personally use grouping a lot in my code, and I don't like having to use pandas for it. Most importantly, I don't want to go around creating a dataframe for a single oneline hitandrun association between keys and values. The permanent association of different types of data and their metadata which pandas offers is I think the key difference from numpy, which is all about manipulating just plain ndarrays. Arguably, grouping itself is a pretty elementary manipulating of ndarrays, and adding calls to DataFrame or Series inbetween a statement that could just be simply group_by(keys).mean(values) feels wrong to me. As does including pandas as a dependency just to use this small piece of functionality. Grouping is a more general functionality than any particular method of organizing your data.
In terms of features, adding transformations and filtering might be nice too; I hadn't thought about it, but that is because unlike the currently implemented features, the need has never arose for me. Im only a small sample size, and I don't see any fundamental objection to adding such functionality though. It certainly raises the question as to where to draw the line with pandas; but my rule of thumb is that if you can think of it as an elementary operation on ndarrays, then it probably belongs in numpy.
Oh I forgot to add: with an indexing mechanism based on sorting, unique values and counts also come 'for free', not counting the O(N) cost of actually creating those arrays. The only time an operating relying on an index incurs another nontrivial amount of overhead is in case its 'rank' or 'inverse' property is used, which invokes another argsort. But for the vast majority of grouping or set operations, these properties are never used.
That extra argsort is now gone from master:
https://github.com/numpy/numpy/pull/5012
Even with this improvement, returning any index typically makes `np.unique` run at least 2x slower:
In [1]: import numpy as np In [2]: a = np.random.randint(100, size=(1000,)) In [3]: %timeit np.unique(a) 10000 loops, best of 3: 37.3 us per loop In [4]: %timeit np.unique(a, return_inverse=True) 10000 loops, best of 3: 62.1 us per loop In [5]: %timeit np.unique(a, return_index=True) 10000 loops, best of 3: 72.8 us per loop In [6]: %timeit np.unique(a, return_counts=True) 10000 loops, best of 3: 56.4 us per loop In [7]: %timeit np.unique(a, return_index=True, return_inverse=True, return_coun ts=True) 10000 loops, best of 3: 112 us per loop
I should clarify: I am speaking about my implementation, I havnt looked at the numpy implementation for a while so im not sure what it is up to. Note that by 'almost free', we are still talking about three passes over the whole array plus temp allocations, but I am assuming a usecase where the various sorts involved are the dominant cost, which I imagine they are, for outofcache sorts. Perhaps this isn't too realistic an assumption about the average use case though, I don't know. Though I suppose its a reasonable guideline to assume that either the dataset is big, or performance isn't that big a concern in the first place.
On Thu, Sep 4, 2014 at 7:55 PM, Jaime Fernández del Río < jaime.frio@gmail.com> wrote:
On Thu, Sep 4, 2014 at 10:39 AM, Eelco Hoogendoorn < hoogendoorn.eelco@gmail.com> wrote:
On Thu, Sep 4, 2014 at 10:31 AM, Eelco Hoogendoorn < hoogendoorn.eelco@gmail.com> wrote:
On Wed, Sep 3, 2014 at 6:46 PM, Jaime Fernández del Río < jaime.frio@gmail.com> wrote:
On Wed, Sep 3, 2014 at 9:33 AM, Jaime Fernández del Río < jaime.frio@gmail.com> wrote:
On Wed, Sep 3, 2014 at 6:41 AM, Eelco Hoogendoorn < hoogendoorn.eelco@gmail.com> wrote:
Not sure about the hashing. Indeed one can also build an index of a set by means of a hash table, but its questionable if this leads to improved performance over performing an argsort. Hashing may have better asymptotic time complexity in theory, but many datasets used in practice are very easy to sort (O(N)ish), and the timeconstant of hashing is higher. But more importantly, using a hashtable guarantees poor cache behavior for many operations using this index. By contrast, sorting may (but need not) make one random access pass to build the index, and may (but need not) perform one random access to reorder values for grouping. But insofar as the keys are better behaved than pure random, this coherence will be exploited.
If you want to give it a try, these branch of my numpy fork has hash table based implementations of unique (with no extra indices) and in1d:
https://github.com/jaimefrio/numpy/tree/hashunique
A use cases where the hash table is clearly better:
In [1]: import numpy as np In [2]: from numpy.lib._compiled_base import _unique, _in1d
In [3]: a = np.random.randint(10, size=(10000,)) In [4]: %timeit np.unique(a) 1000 loops, best of 3: 258 us per loop In [5]: %timeit _unique(a) 10000 loops, best of 3: 143 us per loop In [6]: %timeit np.sort(_unique(a)) 10000 loops, best of 3: 149 us per loop
It typically performs between 1.5x and 4x faster than sorting. I haven't profiled it properly to know, but there may be quite a bit of performance to dig out: have type specific comparison functions, optimize the starting hash table size based on the size of the array to avoid reinsertions...
If getting the elements sorted is a necessity, and the array contains very few or no repeated items, then the hash table approach may even perform worse,:
In [8]: a = np.random.randint(10000, size=(5000,)) In [9]: %timeit np.unique(a) 1000 loops, best of 3: 277 us per loop In [10]: %timeit np.sort(_unique(a)) 1000 loops, best of 3: 320 us per loop
But the hash table still wins in extracting the unique items only:
In [11]: %timeit _unique(a) 10000 loops, best of 3: 187 us per loop
Where the hash table shines is in more elaborate situations. If you keep the first index where it was found, and the number of repeats, in the hash table, you can get return_index and return_counts almost for free, which means you are performing an extra 3x faster than with sorting. return_inverse requires a little more trickery, so I won;t attempt to quantify the improvement. But I wouldn't be surprised if, after fine tuning it, there is close to an order of magnitude overall improvement
The sppedup for in1d is also nice:
In [16]: a = np.random.randint(1000, size=(1000,)) In [17]: b = np.random.randint(1000, size=(500,)) In [18]: %timeit np.in1d(a, b) 1000 loops, best of 3: 178 us per loop In [19]: %timeit _in1d(a, b) 10000 loops, best of 3: 30.1 us per loop
Of course, there is no point in
Ooops!!! Hit the send button too quick. Not to extend myself too long: if we are going to rethink all of this, we should approach it with an open mind. Still, and this post is not helping with that either, I am afraid that we are discussing implementation details, but are missing a broader vision of what we want to accomplish and why. That vision of what numpy's grouping functionality, if any, should be, and how it complements or conflicts with what pandas is providing, should precede anything else. I now I haven't, but has anyone looked at how pandas implements grouping? Their documentation on the subject is well worth a read:
http://pandas.pydata.org/pandasdocs/stable/groupby.html
Does numpy need to replicate this? What/why/how can we add to that?
Jaime
 (__/) ( O.o) ( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial.
NumPyDiscussion mailing list NumPyDiscussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpydiscussion
I would certainly not be opposed to having a hashing based indexing
mechanism; I think it would make sense designwise to have a HashIndex class with the same interface as the rest, and use that subclass in those arraysetops where it makes sense. The 'how to' of indexing and its applications are largely orthogonal I think (with some tiny performance compromises which are worth the abstraction imo). For datasets which are not purely random, have many unique items, and which do not fit into cache, I would expect sorting to come out on top, but indeed it depends on the dataset.
Yeah, the question how pandas does grouping, and whether we can do better, is a relevant one.
From what I understand, pandas relies on cython extensions to get vectorized grouping functionality. This is no longer necessary since the introduction of ufuncs in numpy. I don't know how the implementations compare in terms of performance, but I doubt the difference is huge.
I personally use grouping a lot in my code, and I don't like having to use pandas for it. Most importantly, I don't want to go around creating a dataframe for a single oneline hitandrun association between keys and values. The permanent association of different types of data and their metadata which pandas offers is I think the key difference from numpy, which is all about manipulating just plain ndarrays. Arguably, grouping itself is a pretty elementary manipulating of ndarrays, and adding calls to DataFrame or Series inbetween a statement that could just be simply group_by(keys).mean(values) feels wrong to me. As does including pandas as a dependency just to use this small piece of functionality. Grouping is a more general functionality than any particular method of organizing your data.
In terms of features, adding transformations and filtering might be nice too; I hadn't thought about it, but that is because unlike the currently implemented features, the need has never arose for me. Im only a small sample size, and I don't see any fundamental objection to adding such functionality though. It certainly raises the question as to where to draw the line with pandas; but my rule of thumb is that if you can think of it as an elementary operation on ndarrays, then it probably belongs in numpy.
Oh I forgot to add: with an indexing mechanism based on sorting, unique values and counts also come 'for free', not counting the O(N) cost of actually creating those arrays. The only time an operating relying on an index incurs another nontrivial amount of overhead is in case its 'rank' or 'inverse' property is used, which invokes another argsort. But for the vast majority of grouping or set operations, these properties are never used.
That extra argsort is now gone from master:
https://github.com/numpy/numpy/pull/5012
Even with this improvement, returning any index typically makes `np.unique` run at least 2x slower:
In [1]: import numpy as np In [2]: a = np.random.randint(100, size=(1000,)) In [3]: %timeit np.unique(a) 10000 loops, best of 3: 37.3 us per loop In [4]: %timeit np.unique(a, return_inverse=True) 10000 loops, best of 3: 62.1 us per loop In [5]: %timeit np.unique(a, return_index=True) 10000 loops, best of 3: 72.8 us per loop In [6]: %timeit np.unique(a, return_counts=True) 10000 loops, best of 3: 56.4 us per loop In [7]: %timeit np.unique(a, return_index=True, return_inverse=True, return_coun ts=True) 10000 loops, best of 3: 112 us per loop
 (__/) ( O.o) ( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial.
NumPyDiscussion mailing list NumPyDiscussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpydiscussion
On Thu, Sep 4, 2014 at 8:14 PM, Eelco Hoogendoorn < hoogendoorn.eelco@gmail.com> wrote:
I should clarify: I am speaking about my implementation, I havnt looked at the numpy implementation for a while so im not sure what it is up to. Note that by 'almost free', we are still talking about three passes over the whole array plus temp allocations, but I am assuming a usecase where the various sorts involved are the dominant cost, which I imagine they are, for outofcache sorts. Perhaps this isn't too realistic an assumption about the average use case though, I don't know. Though I suppose its a reasonable guideline to assume that either the dataset is big, or performance isn't that big a concern in the first place.
On Thu, Sep 4, 2014 at 7:55 PM, Jaime Fernández del Río < jaime.frio@gmail.com> wrote:
On Thu, Sep 4, 2014 at 10:39 AM, Eelco Hoogendoorn < hoogendoorn.eelco@gmail.com> wrote:
On Thu, Sep 4, 2014 at 10:31 AM, Eelco Hoogendoorn < hoogendoorn.eelco@gmail.com> wrote:
On Wed, Sep 3, 2014 at 6:46 PM, Jaime Fernández del Río < jaime.frio@gmail.com> wrote:
On Wed, Sep 3, 2014 at 9:33 AM, Jaime Fernández del Río < jaime.frio@gmail.com> wrote:
On Wed, Sep 3, 2014 at 6:41 AM, Eelco Hoogendoorn < hoogendoorn.eelco@gmail.com> wrote:
> Not sure about the hashing. Indeed one can also build an index of a > set by means of a hash table, but its questionable if this leads to > improved performance over performing an argsort. Hashing may have better > asymptotic time complexity in theory, but many datasets used in practice > are very easy to sort (O(N)ish), and the timeconstant of hashing is > higher. But more importantly, using a hashtable guarantees poor cache > behavior for many operations using this index. By contrast, sorting may > (but need not) make one random access pass to build the index, and may (but > need not) perform one random access to reorder values for grouping. But > insofar as the keys are better behaved than pure random, this coherence > will be exploited. >
If you want to give it a try, these branch of my numpy fork has hash table based implementations of unique (with no extra indices) and in1d:
https://github.com/jaimefrio/numpy/tree/hashunique
A use cases where the hash table is clearly better:
In [1]: import numpy as np In [2]: from numpy.lib._compiled_base import _unique, _in1d
In [3]: a = np.random.randint(10, size=(10000,)) In [4]: %timeit np.unique(a) 1000 loops, best of 3: 258 us per loop In [5]: %timeit _unique(a) 10000 loops, best of 3: 143 us per loop In [6]: %timeit np.sort(_unique(a)) 10000 loops, best of 3: 149 us per loop
It typically performs between 1.5x and 4x faster than sorting. I haven't profiled it properly to know, but there may be quite a bit of performance to dig out: have type specific comparison functions, optimize the starting hash table size based on the size of the array to avoid reinsertions...
If getting the elements sorted is a necessity, and the array contains very few or no repeated items, then the hash table approach may even perform worse,:
In [8]: a = np.random.randint(10000, size=(5000,)) In [9]: %timeit np.unique(a) 1000 loops, best of 3: 277 us per loop In [10]: %timeit np.sort(_unique(a)) 1000 loops, best of 3: 320 us per loop
But the hash table still wins in extracting the unique items only:
In [11]: %timeit _unique(a) 10000 loops, best of 3: 187 us per loop
Where the hash table shines is in more elaborate situations. If you keep the first index where it was found, and the number of repeats, in the hash table, you can get return_index and return_counts almost for free, which means you are performing an extra 3x faster than with sorting. return_inverse requires a little more trickery, so I won;t attempt to quantify the improvement. But I wouldn't be surprised if, after fine tuning it, there is close to an order of magnitude overall improvement
The sppedup for in1d is also nice:
In [16]: a = np.random.randint(1000, size=(1000,)) In [17]: b = np.random.randint(1000, size=(500,)) In [18]: %timeit np.in1d(a, b) 1000 loops, best of 3: 178 us per loop In [19]: %timeit _in1d(a, b) 10000 loops, best of 3: 30.1 us per loop
Of course, there is no point in
Ooops!!! Hit the send button too quick. Not to extend myself too long: if we are going to rethink all of this, we should approach it with an open mind. Still, and this post is not helping with that either, I am afraid that we are discussing implementation details, but are missing a broader vision of what we want to accomplish and why. That vision of what numpy's grouping functionality, if any, should be, and how it complements or conflicts with what pandas is providing, should precede anything else. I now I haven't, but has anyone looked at how pandas implements grouping? Their documentation on the subject is well worth a read:
http://pandas.pydata.org/pandasdocs/stable/groupby.html
Does numpy need to replicate this? What/why/how can we add to that?
Jaime
 (__/) ( O.o) ( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial.
NumPyDiscussion mailing list NumPyDiscussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpydiscussion
I would certainly not be opposed to having a hashing based indexing
mechanism; I think it would make sense designwise to have a HashIndex class with the same interface as the rest, and use that subclass in those arraysetops where it makes sense. The 'how to' of indexing and its applications are largely orthogonal I think (with some tiny performance compromises which are worth the abstraction imo). For datasets which are not purely random, have many unique items, and which do not fit into cache, I would expect sorting to come out on top, but indeed it depends on the dataset.
Yeah, the question how pandas does grouping, and whether we can do better, is a relevant one.
From what I understand, pandas relies on cython extensions to get vectorized grouping functionality. This is no longer necessary since the introduction of ufuncs in numpy. I don't know how the implementations compare in terms of performance, but I doubt the difference is huge.
I personally use grouping a lot in my code, and I don't like having to use pandas for it. Most importantly, I don't want to go around creating a dataframe for a single oneline hitandrun association between keys and values. The permanent association of different types of data and their metadata which pandas offers is I think the key difference from numpy, which is all about manipulating just plain ndarrays. Arguably, grouping itself is a pretty elementary manipulating of ndarrays, and adding calls to DataFrame or Series inbetween a statement that could just be simply group_by(keys).mean(values) feels wrong to me. As does including pandas as a dependency just to use this small piece of functionality. Grouping is a more general functionality than any particular method of organizing your data.
In terms of features, adding transformations and filtering might be nice too; I hadn't thought about it, but that is because unlike the currently implemented features, the need has never arose for me. Im only a small sample size, and I don't see any fundamental objection to adding such functionality though. It certainly raises the question as to where to draw the line with pandas; but my rule of thumb is that if you can think of it as an elementary operation on ndarrays, then it probably belongs in numpy.
Oh I forgot to add: with an indexing mechanism based on sorting, unique values and counts also come 'for free', not counting the O(N) cost of actually creating those arrays. The only time an operating relying on an index incurs another nontrivial amount of overhead is in case its 'rank' or 'inverse' property is used, which invokes another argsort. But for the vast majority of grouping or set operations, these properties are never used.
That extra argsort is now gone from master:
https://github.com/numpy/numpy/pull/5012
Even with this improvement, returning any index typically makes `np.unique` run at least 2x slower:
In [1]: import numpy as np In [2]: a = np.random.randint(100, size=(1000,)) In [3]: %timeit np.unique(a) 10000 loops, best of 3: 37.3 us per loop In [4]: %timeit np.unique(a, return_inverse=True) 10000 loops, best of 3: 62.1 us per loop In [5]: %timeit np.unique(a, return_index=True) 10000 loops, best of 3: 72.8 us per loop In [6]: %timeit np.unique(a, return_counts=True) 10000 loops, best of 3: 56.4 us per loop In [7]: %timeit np.unique(a, return_index=True, return_inverse=True, return_coun ts=True) 10000 loops, best of 3: 112 us per loop
 (__/) ( O.o) ( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial.
NumPyDiscussion mailing list NumPyDiscussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpydiscussion
Yeah, I looked at the numpy implementation, and it seems these speed differences are simply the result of the extra O(N) costs involved, so my implementation would have the same characteristics. If another array copy or two has meaningful impact on performance, then you are done as far as optimization within numpy is concerned, id say. You could fuse those loops on the C level, but as you said I think its better to think about these kind of optimizations once we have a more complete picture of the functionality we want.
Good call on removing the extra argsort, that hadn't occurred to me.
FYI pandas DOES use a very performant hash table impl for unique (and value_counts). Sorted state IS maintained by underlying Index implmentation. https://github.com/pydata/pandas/blob/master/pandas/hashtable.pyx
In [8]: a = np.random.randint(10, size=(10000,))
In [9]: %timeit np.unique(a) 1000 loops, best of 3: 284 µs per loop
In [10]: %timeit Series(a).unique() 10000 loops, best of 3: 161 µs per loop
In [11]: s = Series(a)
# without the creation overhead In [12]: %timeit s.unique() 10000 loops, best of 3: 75.3 µs per loop
On Thu, Sep 4, 2014 at 2:29 PM, Eelco Hoogendoorn < hoogendoorn.eelco@gmail.com> wrote:
On Thu, Sep 4, 2014 at 8:14 PM, Eelco Hoogendoorn < hoogendoorn.eelco@gmail.com> wrote:
I should clarify: I am speaking about my implementation, I havnt looked at the numpy implementation for a while so im not sure what it is up to. Note that by 'almost free', we are still talking about three passes over the whole array plus temp allocations, but I am assuming a usecase where the various sorts involved are the dominant cost, which I imagine they are, for outofcache sorts. Perhaps this isn't too realistic an assumption about the average use case though, I don't know. Though I suppose its a reasonable guideline to assume that either the dataset is big, or performance isn't that big a concern in the first place.
On Thu, Sep 4, 2014 at 7:55 PM, Jaime Fernández del Río < jaime.frio@gmail.com> wrote:
On Thu, Sep 4, 2014 at 10:39 AM, Eelco Hoogendoorn < hoogendoorn.eelco@gmail.com> wrote:
On Thu, Sep 4, 2014 at 10:31 AM, Eelco Hoogendoorn < hoogendoorn.eelco@gmail.com> wrote:
On Wed, Sep 3, 2014 at 6:46 PM, Jaime Fernández del Río < jaime.frio@gmail.com> wrote:
On Wed, Sep 3, 2014 at 9:33 AM, Jaime Fernández del Río < jaime.frio@gmail.com> wrote:
> On Wed, Sep 3, 2014 at 6:41 AM, Eelco Hoogendoorn < > hoogendoorn.eelco@gmail.com> wrote: > >> Not sure about the hashing. Indeed one can also build an index of >> a set by means of a hash table, but its questionable if this leads to >> improved performance over performing an argsort. Hashing may have better >> asymptotic time complexity in theory, but many datasets used in practice >> are very easy to sort (O(N)ish), and the timeconstant of hashing is >> higher. But more importantly, using a hashtable guarantees poor cache >> behavior for many operations using this index. By contrast, sorting may >> (but need not) make one random access pass to build the index, and may (but >> need not) perform one random access to reorder values for grouping. But >> insofar as the keys are better behaved than pure random, this coherence >> will be exploited. >> > > If you want to give it a try, these branch of my numpy fork has hash > table based implementations of unique (with no extra indices) and in1d: > > https://github.com/jaimefrio/numpy/tree/hashunique > > A use cases where the hash table is clearly better: > > In [1]: import numpy as np > In [2]: from numpy.lib._compiled_base import _unique, _in1d > > In [3]: a = np.random.randint(10, size=(10000,)) > In [4]: %timeit np.unique(a) > 1000 loops, best of 3: 258 us per loop > In [5]: %timeit _unique(a) > 10000 loops, best of 3: 143 us per loop > In [6]: %timeit np.sort(_unique(a)) > 10000 loops, best of 3: 149 us per loop > > It typically performs between 1.5x and 4x faster than sorting. I > haven't profiled it properly to know, but there may be quite a bit of > performance to dig out: have type specific comparison functions, optimize > the starting hash table size based on the size of the array to avoid > reinsertions... > > If getting the elements sorted is a necessity, and the array > contains very few or no repeated items, then the hash table approach may > even perform worse,: > > In [8]: a = np.random.randint(10000, size=(5000,)) > In [9]: %timeit np.unique(a) > 1000 loops, best of 3: 277 us per loop > In [10]: %timeit np.sort(_unique(a)) > 1000 loops, best of 3: 320 us per loop > > But the hash table still wins in extracting the unique items only: > > In [11]: %timeit _unique(a) > 10000 loops, best of 3: 187 us per loop > > Where the hash table shines is in more elaborate situations. If you > keep the first index where it was found, and the number of repeats, in the > hash table, you can get return_index and return_counts almost for free, > which means you are performing an extra 3x faster than with sorting. > return_inverse requires a little more trickery, so I won;t attempt to > quantify the improvement. But I wouldn't be surprised if, after fine tuning > it, there is close to an order of magnitude overall improvement > > The sppedup for in1d is also nice: > > In [16]: a = np.random.randint(1000, size=(1000,)) > In [17]: b = np.random.randint(1000, size=(500,)) > In [18]: %timeit np.in1d(a, b) > 1000 loops, best of 3: 178 us per loop > In [19]: %timeit _in1d(a, b) > 10000 loops, best of 3: 30.1 us per loop > > Of course, there is no point in >
Ooops!!! Hit the send button too quick. Not to extend myself too long: if we are going to rethink all of this, we should approach it with an open mind. Still, and this post is not helping with that either, I am afraid that we are discussing implementation details, but are missing a broader vision of what we want to accomplish and why. That vision of what numpy's grouping functionality, if any, should be, and how it complements or conflicts with what pandas is providing, should precede anything else. I now I haven't, but has anyone looked at how pandas implements grouping? Their documentation on the subject is well worth a read:
http://pandas.pydata.org/pandasdocs/stable/groupby.html
Does numpy need to replicate this? What/why/how can we add to that?
Jaime
 (__/) ( O.o) ( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial.
NumPyDiscussion mailing list NumPyDiscussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpydiscussion
I would certainly not be opposed to having a hashing based indexing
mechanism; I think it would make sense designwise to have a HashIndex class with the same interface as the rest, and use that subclass in those arraysetops where it makes sense. The 'how to' of indexing and its applications are largely orthogonal I think (with some tiny performance compromises which are worth the abstraction imo). For datasets which are not purely random, have many unique items, and which do not fit into cache, I would expect sorting to come out on top, but indeed it depends on the dataset.
Yeah, the question how pandas does grouping, and whether we can do better, is a relevant one.
From what I understand, pandas relies on cython extensions to get vectorized grouping functionality. This is no longer necessary since the introduction of ufuncs in numpy. I don't know how the implementations compare in terms of performance, but I doubt the difference is huge.
I personally use grouping a lot in my code, and I don't like having to use pandas for it. Most importantly, I don't want to go around creating a dataframe for a single oneline hitandrun association between keys and values. The permanent association of different types of data and their metadata which pandas offers is I think the key difference from numpy, which is all about manipulating just plain ndarrays. Arguably, grouping itself is a pretty elementary manipulating of ndarrays, and adding calls to DataFrame or Series inbetween a statement that could just be simply group_by(keys).mean(values) feels wrong to me. As does including pandas as a dependency just to use this small piece of functionality. Grouping is a more general functionality than any particular method of organizing your data.
In terms of features, adding transformations and filtering might be nice too; I hadn't thought about it, but that is because unlike the currently implemented features, the need has never arose for me. Im only a small sample size, and I don't see any fundamental objection to adding such functionality though. It certainly raises the question as to where to draw the line with pandas; but my rule of thumb is that if you can think of it as an elementary operation on ndarrays, then it probably belongs in numpy.
Oh I forgot to add: with an indexing mechanism based on sorting, unique values and counts also come 'for free', not counting the O(N) cost of actually creating those arrays. The only time an operating relying on an index incurs another nontrivial amount of overhead is in case its 'rank' or 'inverse' property is used, which invokes another argsort. But for the vast majority of grouping or set operations, these properties are never used.
That extra argsort is now gone from master:
https://github.com/numpy/numpy/pull/5012
Even with this improvement, returning any index typically makes `np.unique` run at least 2x slower:
In [1]: import numpy as np In [2]: a = np.random.randint(100, size=(1000,)) In [3]: %timeit np.unique(a) 10000 loops, best of 3: 37.3 us per loop In [4]: %timeit np.unique(a, return_inverse=True) 10000 loops, best of 3: 62.1 us per loop In [5]: %timeit np.unique(a, return_index=True) 10000 loops, best of 3: 72.8 us per loop In [6]: %timeit np.unique(a, return_counts=True) 10000 loops, best of 3: 56.4 us per loop In [7]: %timeit np.unique(a, return_index=True, return_inverse=True, return_coun ts=True) 10000 loops, best of 3: 112 us per loop
 (__/) ( O.o) ( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial.
NumPyDiscussion mailing list NumPyDiscussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpydiscussion
Yeah, I looked at the numpy implementation, and it seems these speed differences are simply the result of the extra O(N) costs involved, so my implementation would have the same characteristics. If another array copy or two has meaningful impact on performance, then you are done as far as optimization within numpy is concerned, id say. You could fuse those loops on the C level, but as you said I think its better to think about these kind of optimizations once we have a more complete picture of the functionality we want.
Good call on removing the extra argsort, that hadn't occurred to me.
NumPyDiscussion mailing list NumPyDiscussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpydiscussion
Naturally, youd want to avoid redoing the indexing where you can, which is another good reason to factor out the indexing mechanisms into separate classes. A factor two performance difference does not get me too excited; again, I think it would be the other way around for an outofcache dataset being grouped. But this by itself is ofcourse another argument for factoring out the indexing behind a uniform interface, so we can play around with those implementation details later, and specialize the indexing to serve different scenarios. Also, it really helps with code maintainability; most arraysetops are almost trivial to implement once you have abstracted away the indexing machinery.
On Thu, Sep 4, 2014 at 8:36 PM, Jeff Reback jeffreback@gmail.com wrote:
FYI pandas DOES use a very performant hash table impl for unique (and value_counts). Sorted state IS maintained by underlying Index implmentation. https://github.com/pydata/pandas/blob/master/pandas/hashtable.pyx
In [8]: a = np.random.randint(10, size=(10000,))
In [9]: %timeit np.unique(a) 1000 loops, best of 3: 284 µs per loop
In [10]: %timeit Series(a).unique() 10000 loops, best of 3: 161 µs per loop
In [11]: s = Series(a)
# without the creation overhead In [12]: %timeit s.unique() 10000 loops, best of 3: 75.3 µs per loop
On Thu, Sep 4, 2014 at 2:29 PM, Eelco Hoogendoorn < hoogendoorn.eelco@gmail.com> wrote:
On Thu, Sep 4, 2014 at 8:14 PM, Eelco Hoogendoorn < hoogendoorn.eelco@gmail.com> wrote:
I should clarify: I am speaking about my implementation, I havnt looked at the numpy implementation for a while so im not sure what it is up to. Note that by 'almost free', we are still talking about three passes over the whole array plus temp allocations, but I am assuming a usecase where the various sorts involved are the dominant cost, which I imagine they are, for outofcache sorts. Perhaps this isn't too realistic an assumption about the average use case though, I don't know. Though I suppose its a reasonable guideline to assume that either the dataset is big, or performance isn't that big a concern in the first place.
On Thu, Sep 4, 2014 at 7:55 PM, Jaime Fernández del Río < jaime.frio@gmail.com> wrote:
On Thu, Sep 4, 2014 at 10:39 AM, Eelco Hoogendoorn < hoogendoorn.eelco@gmail.com> wrote:
On Thu, Sep 4, 2014 at 10:31 AM, Eelco Hoogendoorn < hoogendoorn.eelco@gmail.com> wrote:
On Wed, Sep 3, 2014 at 6:46 PM, Jaime Fernández del Río < jaime.frio@gmail.com> wrote:
> On Wed, Sep 3, 2014 at 9:33 AM, Jaime Fernández del Río < > jaime.frio@gmail.com> wrote: > >> On Wed, Sep 3, 2014 at 6:41 AM, Eelco Hoogendoorn < >> hoogendoorn.eelco@gmail.com> wrote: >> >>> Not sure about the hashing. Indeed one can also build an index of >>> a set by means of a hash table, but its questionable if this leads to >>> improved performance over performing an argsort. Hashing may have better >>> asymptotic time complexity in theory, but many datasets used in practice >>> are very easy to sort (O(N)ish), and the timeconstant of hashing is >>> higher. But more importantly, using a hashtable guarantees poor cache >>> behavior for many operations using this index. By contrast, sorting may >>> (but need not) make one random access pass to build the index, and may (but >>> need not) perform one random access to reorder values for grouping. But >>> insofar as the keys are better behaved than pure random, this coherence >>> will be exploited. >>> >> >> If you want to give it a try, these branch of my numpy fork has >> hash table based implementations of unique (with no extra indices) and in1d: >> >> https://github.com/jaimefrio/numpy/tree/hashunique >> >> A use cases where the hash table is clearly better: >> >> In [1]: import numpy as np >> In [2]: from numpy.lib._compiled_base import _unique, _in1d >> >> In [3]: a = np.random.randint(10, size=(10000,)) >> In [4]: %timeit np.unique(a) >> 1000 loops, best of 3: 258 us per loop >> In [5]: %timeit _unique(a) >> 10000 loops, best of 3: 143 us per loop >> In [6]: %timeit np.sort(_unique(a)) >> 10000 loops, best of 3: 149 us per loop >> >> It typically performs between 1.5x and 4x faster than sorting. I >> haven't profiled it properly to know, but there may be quite a bit of >> performance to dig out: have type specific comparison functions, optimize >> the starting hash table size based on the size of the array to avoid >> reinsertions... >> >> If getting the elements sorted is a necessity, and the array >> contains very few or no repeated items, then the hash table approach may >> even perform worse,: >> >> In [8]: a = np.random.randint(10000, size=(5000,)) >> In [9]: %timeit np.unique(a) >> 1000 loops, best of 3: 277 us per loop >> In [10]: %timeit np.sort(_unique(a)) >> 1000 loops, best of 3: 320 us per loop >> >> But the hash table still wins in extracting the unique items only: >> >> In [11]: %timeit _unique(a) >> 10000 loops, best of 3: 187 us per loop >> >> Where the hash table shines is in more elaborate situations. If you >> keep the first index where it was found, and the number of repeats, in the >> hash table, you can get return_index and return_counts almost for free, >> which means you are performing an extra 3x faster than with sorting. >> return_inverse requires a little more trickery, so I won;t attempt to >> quantify the improvement. But I wouldn't be surprised if, after fine tuning >> it, there is close to an order of magnitude overall improvement >> >> The sppedup for in1d is also nice: >> >> In [16]: a = np.random.randint(1000, size=(1000,)) >> In [17]: b = np.random.randint(1000, size=(500,)) >> In [18]: %timeit np.in1d(a, b) >> 1000 loops, best of 3: 178 us per loop >> In [19]: %timeit _in1d(a, b) >> 10000 loops, best of 3: 30.1 us per loop >> >> Of course, there is no point in >> > > Ooops!!! Hit the send button too quick. Not to extend myself too > long: if we are going to rethink all of this, we should approach it with an > open mind. Still, and this post is not helping with that either, I am > afraid that we are discussing implementation details, but are missing a > broader vision of what we want to accomplish and why. That vision of what > numpy's grouping functionality, if any, should be, and how it complements > or conflicts with what pandas is providing, should precede anything else. I > now I haven't, but has anyone looked at how pandas implements grouping? > Their documentation on the subject is well worth a read: > > http://pandas.pydata.org/pandasdocs/stable/groupby.html > > Does numpy need to replicate this? What/why/how can we add to that? > > Jaime > >  > (__/) > ( O.o) > ( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus > planes de dominación mundial. > > _______________________________________________ > NumPyDiscussion mailing list > NumPyDiscussion@scipy.org > http://mail.scipy.org/mailman/listinfo/numpydiscussion > > I would certainly not be opposed to having a hashing based indexing mechanism; I think it would make sense designwise to have a HashIndex class with the same interface as the rest, and use that subclass in those arraysetops where it makes sense. The 'how to' of indexing and its applications are largely orthogonal I think (with some tiny performance compromises which are worth the abstraction imo). For datasets which are not purely random, have many unique items, and which do not fit into cache, I would expect sorting to come out on top, but indeed it depends on the dataset.
Yeah, the question how pandas does grouping, and whether we can do better, is a relevant one.
From what I understand, pandas relies on cython extensions to get vectorized grouping functionality. This is no longer necessary since the introduction of ufuncs in numpy. I don't know how the implementations compare in terms of performance, but I doubt the difference is huge.
I personally use grouping a lot in my code, and I don't like having to use pandas for it. Most importantly, I don't want to go around creating a dataframe for a single oneline hitandrun association between keys and values. The permanent association of different types of data and their metadata which pandas offers is I think the key difference from numpy, which is all about manipulating just plain ndarrays. Arguably, grouping itself is a pretty elementary manipulating of ndarrays, and adding calls to DataFrame or Series inbetween a statement that could just be simply group_by(keys).mean(values) feels wrong to me. As does including pandas as a dependency just to use this small piece of functionality. Grouping is a more general functionality than any particular method of organizing your data.
In terms of features, adding transformations and filtering might be nice too; I hadn't thought about it, but that is because unlike the currently implemented features, the need has never arose for me. Im only a small sample size, and I don't see any fundamental objection to adding such functionality though. It certainly raises the question as to where to draw the line with pandas; but my rule of thumb is that if you can think of it as an elementary operation on ndarrays, then it probably belongs in numpy.
Oh I forgot to add: with an indexing mechanism based on sorting, unique values and counts also come 'for free', not counting the O(N) cost of actually creating those arrays. The only time an operating relying on an index incurs another nontrivial amount of overhead is in case its 'rank' or 'inverse' property is used, which invokes another argsort. But for the vast majority of grouping or set operations, these properties are never used.
That extra argsort is now gone from master:
https://github.com/numpy/numpy/pull/5012
Even with this improvement, returning any index typically makes `np.unique` run at least 2x slower:
In [1]: import numpy as np In [2]: a = np.random.randint(100, size=(1000,)) In [3]: %timeit np.unique(a) 10000 loops, best of 3: 37.3 us per loop In [4]: %timeit np.unique(a, return_inverse=True) 10000 loops, best of 3: 62.1 us per loop In [5]: %timeit np.unique(a, return_index=True) 10000 loops, best of 3: 72.8 us per loop In [6]: %timeit np.unique(a, return_counts=True) 10000 loops, best of 3: 56.4 us per loop In [7]: %timeit np.unique(a, return_index=True, return_inverse=True, return_coun ts=True) 10000 loops, best of 3: 112 us per loop
 (__/) ( O.o) ( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial.
NumPyDiscussion mailing list NumPyDiscussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpydiscussion
Yeah, I looked at the numpy implementation, and it seems these speed differences are simply the result of the extra O(N) costs involved, so my implementation would have the same characteristics. If another array copy or two has meaningful impact on performance, then you are done as far as optimization within numpy is concerned, id say. You could fuse those loops on the C level, but as you said I think its better to think about these kind of optimizations once we have a more complete picture of the functionality we want.
Good call on removing the extra argsort, that hadn't occurred to me.
NumPyDiscussion mailing list NumPyDiscussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpydiscussion
NumPyDiscussion mailing list NumPyDiscussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpydiscussion
Re: [Numpydiscussion] Does a `mergesorted` function make sense?
On 02/09/2014 8:40 PM, Charles R Harris wrote:
On Mon, Sep 1, 2014 at 7:58 AM, Eelco Hoogendoorn <hoogendoorn.eelco@gmail.com mailto:hoogendoorn.eelco@gmail.com> wrote:
On Mon, Sep 1, 2014 at 2:05 PM, Charles R Harris <charlesr.harris@gmail.com <mailto:charlesr.harris@gmail.com>> wrote: On Mon, Sep 1, 2014 at 1:49 AM, Eelco Hoogendoorn <hoogendoorn.eelco@gmail.com <mailto:hoogendoorn.eelco@gmail.com>> wrote: Sure, id like to do the hashing things out, but I would also like some preliminary feedback as to whether this is going in a direction anyone else sees the point of, if it conflicts with other plans, and indeed if we can agree that numpy is the right place for it; a point which I would very much like to defend. If there is some obvious nogo that im missing, I can do without the drudgery of writing proper documentation ;).
These are good issues, that need to be discussed and resolved. Python has the benefit of having a BDFL. Numpy has no similar arrangement. In the postnumarray period, Travis Oliphant took that role and advanced the package in many ways.
It seems that Travis is no longer available. There is a need for someone, or maybe some group of three, who would be guided by the sort of thinking Charles Harris sets out above, who would make the final decision.
Colin W.
On Wed, Sep 3, 2014 at 5:26 AM, cjw cjw@ncf.ca wrote:
These are good issues, that need to be discussed and resolved. Python has the benefit of having a BDFL. Numpy has no similar arrangement. In the postnumarray period, Travis Oliphant took that role and advanced the package in many ways.
It seems that Travis is no longer available. There is a need for someone, or maybe some group of three, who would be guided by the sort of thinking Charles Harris sets out above, who would make the final decision.
We should crown Charles philosopher king, and let him wisely rule over us with the aid of his aristocracy of devs with merge rights. We would make Plato proud! :)
Jaime
On Wed, Sep 3, 2014 at 10:53 AM, Jaime Fernández del Río < jaime.frio@gmail.com> wrote:
On Wed, Sep 3, 2014 at 5:26 AM, cjw cjw@ncf.ca wrote:
These are good issues, that need to be discussed and resolved. Python has the benefit of having a BDFL. Numpy has no similar arrangement. In the postnumarray period, Travis Oliphant took that role and advanced the package in many ways.
It seems that Travis is no longer available. There is a need for someone, or maybe some group of three, who would be guided by the sort of thinking Charles Harris sets out above, who would make the final decision.
We should crown Charles philosopher king, and let him wisely rule over us with the aid of his aristocracy of devs with merge rights. We would make Plato proud! :)
Charles I needs an heir in case of execution by the Parliamentary forces.
Chuck
On Mon, Sep 1, 2014 at 8:49 AM, Eelco Hoogendoorn hoogendoorn.eelco@gmail.com wrote:
Sure, id like to do the hashing things out, but I would also like some preliminary feedback as to whether this is going in a direction anyone else sees the point of, if it conflicts with other plans, and indeed if we can agree that numpy is the right place for it; a point which I would very much like to defend. If there is some obvious nogo that im missing, I can do without the drudgery of writing proper documentation ;).
As for whether this belongs in numpy: yes, I would say so. There are the extension of functionality to functions already in numpy, which are a nobrainer (it need not cost anything performance wise, and ive needed unique graph edges many many times), and there is the grouping functionality, which is the main novelty.
However, note that the grouping functionality itself is a very small addition, just a few 100 lines of pure python, given that the indexing logic has been factored out of the classic arraysetops. At least from a developers perspective, it very much feels like a logical extension of the same 'thing'.
My 2 cents: I definitely agree that this is very useful fundamental functionality, and it would be great if numpy had a solution for it out of the box. My main concern is that this is a fairly complicated set of functionality and there are a lot of small decisions to be made in setting up the API for it. IME it's very hard to just read through an API like this and reason out the best way to do it by pure logic; usually it needs to get banged on for a bit in real uses before it becomes clear what the right set of tradeoffs is. And numpy itself is not a great environment these kinds of iterations. So, IMO the main challenge is: how do we get the functionality into a state where we can convince ourselves that it'll be supportable in numpy indefinitely, and not need to be replaced in a year or two?
Some things that might help with this convincing:  releasing it as a small standalone package on pypi and getting some real users to bang on it  any real code written against the APIs  feedback from the pandas community since they've spent a lot of time working on these issues  ...?
n
On 27 August 2014 19:02, Jaime Fernández del Río jaime.frio@gmail.com wrote:
Since there is at least one other person out there that likes it, is there any more interest in such a function? If yes, any comments on what the proper interface for extra output should be? Although perhaps the best is to leave that out for starters and see what use people make of it, if any.
I think a perhaps more useful thing would be to implement timsort. I understand it is capable of take full advantage of the partially sorted arrays, with the extra safety of not making the assumption that the individual arrays are sorted. This will also be useful for other real world cases where the data is already partially sorted.
participants (7)

Charles R Harris

cjw

Daπid

Eelco Hoogendoorn

Jaime Fernández del Río

Jeff Reback

Nathaniel Smith