Hi All, I've been contemplating new functions that could be added to numpy and thought I'd run them by folks to see if there is any interest. 1) Modified sort/argsort functions that return the maximum k values. This is easy to do with heapsort and almost as easy with mergesort. 2) Ufunc fadd (nanadd?) Treats nan as zero in addition. Should make a faster version of nansum possible. 3) Fast medians. Chuck
On Tue, May 31, 2011 at 20:08, Charles R Harris
Hi All,
I've been contemplating new functions that could be added to numpy and thought I'd run them by folks to see if there is any interest.
1) Modified sort/argsort functions that return the maximum k values. This is easy to do with heapsort and almost as easy with mergesort.
2) Ufunc fadd (nanadd?) Treats nan as zero in addition. Should make a faster version of nansum possible.
3) Fast medians.
+3 -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco
On Tue, May 31, 2011 at 8:08 PM, Charles R Harris wrote: Hi All, I've been contemplating new functions that could be added to numpy and
thought I'd run them by folks to see if there is any interest. 1) Modified sort/argsort functions that return the maximum k values.
This is easy to do with heapsort and almost as easy with mergesort. While you're at, how about a function that finds both the max and min in one
pass? (Mentioned previously in this thread:
http://mail.scipy.org/pipermail/numpy-discussion/2010-June/051072.html)
2) Ufunc fadd (nanadd?) Treats nan as zero in addition. Should make a faster version of nansum possible. 3) Fast medians. Chuck _______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion
On Tue, May 31, 2011 at 7:18 PM, Warren Weckesser < warren.weckesser@enthought.com> wrote:
On Tue, May 31, 2011 at 8:08 PM, Charles R Harris < charlesr.harris@gmail.com> wrote:
Hi All,
I've been contemplating new functions that could be added to numpy and thought I'd run them by folks to see if there is any interest.
1) Modified sort/argsort functions that return the maximum k values. This is easy to do with heapsort and almost as easy with mergesort.
While you're at, how about a function that finds both the max and min in one pass? (Mentioned previously in this thread: http://mail.scipy.org/pipermail/numpy-discussion/2010-June/051072.html)
What should it be called? minmax? This also brings suggests a function that returns both mean and standard deviation, something that could also be implemented using a more stable algorithm than the current one.
2) Ufunc fadd (nanadd?) Treats nan as zero in addition. Should make a
faster version of nansum possible.
3) Fast medians.
Other suggestions are welcome. Most of these are of the low hanging fruit variety and shouldn't be too much work. Chuck
On Tue, May 31, 2011 at 8:26 PM, Charles R Harris wrote: On Tue, May 31, 2011 at 7:18 PM, Warren Weckesser <
warren.weckesser@enthought.com> wrote: On Tue, May 31, 2011 at 8:08 PM, Charles R Harris <
charlesr.harris@gmail.com> wrote: Hi All, I've been contemplating new functions that could be added to numpy and
thought I'd run them by folks to see if there is any interest. 1) Modified sort/argsort functions that return the maximum k values.
This is easy to do with heapsort and almost as easy with mergesort. While you're at, how about a function that finds both the max and min in
one pass? (Mentioned previously in this thread:
http://mail.scipy.org/pipermail/numpy-discussion/2010-June/051072.html) What should it be called? minmax? That's what it is called in IDL. Don't know of any other tool that does
anything similar. This also brings suggests a function that returns both mean and standard
deviation, something that could also be implemented using a more stable
algorithm than the current one. Yeah, it does. An implementation I did once for a project of mine in Matlab
was to have a function that took an index that indicated that I wanted
first/second/third/etc - order moments and it used that to figure out that
it needed to save the sum, sum of squares, sum of cubes, etc. in order to
calculate the mean, std, kurtosis, and so on. Just a suggestion on how to
implement a useful function for stats. Of course, there are all sorts of
issues with summation here, but my arrays were small enough to not worry in
that project.
Ben Root
On Tue, May 31, 2011 at 8:18 PM, Warren Weckesser < warren.weckesser@enthought.com> wrote:
On Tue, May 31, 2011 at 8:08 PM, Charles R Harris < charlesr.harris@gmail.com> wrote:
Hi All,
I've been contemplating new functions that could be added to numpy and thought I'd run them by folks to see if there is any interest.
1) Modified sort/argsort functions that return the maximum k values. This is easy to do with heapsort and almost as easy with mergesort.
While you're at, how about a function that finds both the max and min in one pass? (Mentioned previously in this thread: http://mail.scipy.org/pipermail/numpy-discussion/2010-June/051072.html)
+1 from myself and probably just about anybody in matplotlib. If both the maxs and mins are searched during the same run through an array, I would imagine that would result in a noticeable speedup with automatic range finding. Ben Root
On Tue, May 31, 2011 at 9:31 PM, Benjamin Root
On Tue, May 31, 2011 at 8:18 PM, Warren Weckesser
wrote: On Tue, May 31, 2011 at 8:08 PM, Charles R Harris
wrote: Hi All,
I've been contemplating new functions that could be added to numpy and thought I'd run them by folks to see if there is any interest.
1) Modified sort/argsort functions that return the maximum k values. This is easy to do with heapsort and almost as easy with mergesort.
While you're at, how about a function that finds both the max and min in one pass? (Mentioned previously in this thread: http://mail.scipy.org/pipermail/numpy-discussion/2010-June/051072.html)
+1 from myself and probably just about anybody in matplotlib. If both the maxs and mins are searched during the same run through an array, I would imagine that would result in a noticeable speedup with automatic range finding.
I don't know if it's one pass off the top of my head, but I've used percentile for interpercentile ranges. [docs] [1]: X = np.random.random(1000) [docs] [2]: np.percentile(X,[0,100]) [2]: [0.00016535235312509222, 0.99961513543316571] [docs] [3]: X.min(),X.max() [3]: (0.00016535235312509222, 0.99961513543316571) Skipper
On Tue, May 31, 2011 at 8:36 PM, Skipper Seabold
On Tue, May 31, 2011 at 9:31 PM, Benjamin Root
wrote: On Tue, May 31, 2011 at 8:18 PM, Warren Weckesser
wrote: On Tue, May 31, 2011 at 8:08 PM, Charles R Harris
wrote: Hi All,
I've been contemplating new functions that could be added to numpy and thought I'd run them by folks to see if there is any interest.
1) Modified sort/argsort functions that return the maximum k values. This is easy to do with heapsort and almost as easy with mergesort.
While you're at, how about a function that finds both the max and min in one pass? (Mentioned previously in this thread: http://mail.scipy.org/pipermail/numpy-discussion/2010-June/051072.html)
+1 from myself and probably just about anybody in matplotlib. If both
the
maxs and mins are searched during the same run through an array, I would imagine that would result in a noticeable speedup with automatic range finding.
I don't know if it's one pass off the top of my head, but I've used percentile for interpercentile ranges.
[docs] [1]: X = np.random.random(1000)
[docs] [2]: np.percentile(X,[0,100]) [2]: [0.00016535235312509222, 0.99961513543316571]
[docs] [3]: X.min(),X.max() [3]: (0.00016535235312509222, 0.99961513543316571)
percentile() isn't one pass; using percentile like that is much slower: In [25]: %timeit np.percentile(X,[0,100]) 10000 loops, best of 3: 103 us per loop In [26]: %timeit X.min(),X.max() 100000 loops, best of 3: 11.8 us per loop Warren
Skipper _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
On Tue, May 31, 2011 at 9:53 PM, Warren Weckesser
On Tue, May 31, 2011 at 8:36 PM, Skipper Seabold
wrote: I don't know if it's one pass off the top of my head, but I've used percentile for interpercentile ranges.
[docs] [1]: X = np.random.random(1000)
[docs] [2]: np.percentile(X,[0,100]) [2]: [0.00016535235312509222, 0.99961513543316571]
[docs] [3]: X.min(),X.max() [3]: (0.00016535235312509222, 0.99961513543316571)
percentile() isn't one pass; using percentile like that is much slower:
In [25]: %timeit np.percentile(X,[0,100]) 10000 loops, best of 3: 103 us per loop
In [26]: %timeit X.min(),X.max() 100000 loops, best of 3: 11.8 us per loop
Probably should've checked that before opening my mouth. Never actually used it for a minmax, but it is faster than two calls to scipy.stats.scoreatpercentile. Guess I'm +1 to fast order statistics. Skipper
On Tue, May 31, 2011 at 8:00 PM, Skipper Seabold
On Tue, May 31, 2011 at 9:53 PM, Warren Weckesser
wrote: On Tue, May 31, 2011 at 8:36 PM, Skipper Seabold
wrote: I don't know if it's one pass off the top of my head, but I've used percentile for interpercentile ranges.
[docs] [1]: X = np.random.random(1000)
[docs] [2]: np.percentile(X,[0,100]) [2]: [0.00016535235312509222, 0.99961513543316571]
[docs] [3]: X.min(),X.max() [3]: (0.00016535235312509222, 0.99961513543316571)
percentile() isn't one pass; using percentile like that is much slower:
In [25]: %timeit np.percentile(X,[0,100]) 10000 loops, best of 3: 103 us per loop
In [26]: %timeit X.min(),X.max() 100000 loops, best of 3: 11.8 us per loop
Probably should've checked that before opening my mouth. Never actually used it for a minmax, but it is faster than two calls to scipy.stats.scoreatpercentile. Guess I'm +1 to fast order statistics.
So far the biggest interest seems to be in order statistics of various sorts, so to speak. *Order Statistics* minmax median k'th element largest/smallest k elements *Other Statistics* mean/std *Nan functions* nanadd Chuck
On Tue, May 31, 2011 at 9:26 PM, Charles R Harris
On Tue, May 31, 2011 at 8:00 PM, Skipper Seabold
wrote: On Tue, May 31, 2011 at 9:53 PM, Warren Weckesser
wrote: On Tue, May 31, 2011 at 8:36 PM, Skipper Seabold
wrote: I don't know if it's one pass off the top of my head, but I've used percentile for interpercentile ranges.
[docs] [1]: X = np.random.random(1000)
[docs] [2]: np.percentile(X,[0,100]) [2]: [0.00016535235312509222, 0.99961513543316571]
[docs] [3]: X.min(),X.max() [3]: (0.00016535235312509222, 0.99961513543316571)
percentile() isn't one pass; using percentile like that is much slower:
In [25]: %timeit np.percentile(X,[0,100]) 10000 loops, best of 3: 103 us per loop
In [26]: %timeit X.min(),X.max() 100000 loops, best of 3: 11.8 us per loop
Probably should've checked that before opening my mouth. Never actually used it for a minmax, but it is faster than two calls to scipy.stats.scoreatpercentile. Guess I'm +1 to fast order statistics.
So far the biggest interest seems to be in order statistics of various sorts, so to speak.
Order Statistics
minmax median k'th element largest/smallest k elements
Other Statistics
mean/std
Nan functions
nanadd
Chuck
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
How about including all or some of Keith's Bottleneck package? He has tried to include some of the discussed functions and tried to make them very fast. Also, this Wikipedia "Algorithms for calculating variance" (http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance) has some basic info on calculating the variance as well as higher order moments.However, there are probably more efficient algorithms available. Bruce
On Tue, May 31, 2011 at 8:50 PM, Bruce Southey
On Tue, May 31, 2011 at 9:26 PM, Charles R Harris
wrote: On Tue, May 31, 2011 at 8:00 PM, Skipper Seabold
wrote: On Tue, May 31, 2011 at 9:53 PM, Warren Weckesser
wrote: On Tue, May 31, 2011 at 8:36 PM, Skipper Seabold
wrote:
I don't know if it's one pass off the top of my head, but I've used percentile for interpercentile ranges.
[docs] [1]: X = np.random.random(1000)
[docs] [2]: np.percentile(X,[0,100]) [2]: [0.00016535235312509222, 0.99961513543316571]
[docs] [3]: X.min(),X.max() [3]: (0.00016535235312509222, 0.99961513543316571)
percentile() isn't one pass; using percentile like that is much
slower:
In [25]: %timeit np.percentile(X,[0,100]) 10000 loops, best of 3: 103 us per loop
In [26]: %timeit X.min(),X.max() 100000 loops, best of 3: 11.8 us per loop
Probably should've checked that before opening my mouth. Never actually used it for a minmax, but it is faster than two calls to scipy.stats.scoreatpercentile. Guess I'm +1 to fast order statistics.
So far the biggest interest seems to be in order statistics of various sorts, so to speak.
Order Statistics
minmax median k'th element largest/smallest k elements
Other Statistics
mean/std
Nan functions
nanadd
Chuck
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
How about including all or some of Keith's Bottleneck package? He has tried to include some of the discussed functions and tried to make them very fast.
I don't think they are sufficiently general as they are limited to 2 dimensions. However, I think the moving filters should go into scipy, either in ndimage or maybe signals. Some of the others we can still speed of significantly, for instance nanmedian, by using the new functionality in numpy, i.e., numpy sort has worked with nans for a while now. It looks like call overhead dominates the nanmax times for small arrays and this might improve if the ufunc machinery is cleaned up a bit more, I don't know how far Mark got with that. One bit of infrastructure that could be generally helpful is low-level support for masked arrays, but that is a larger topic. Chuck
On Tue, May 31, 2011 at 8:41 PM, Charles R Harris
On Tue, May 31, 2011 at 8:50 PM, Bruce Southey
wrote:
How about including all or some of Keith's Bottleneck package? He has tried to include some of the discussed functions and tried to make them very fast.
I don't think they are sufficiently general as they are limited to 2 dimensions. However, I think the moving filters should go into scipy, either in ndimage or maybe signals. Some of the others we can still speed of significantly, for instance nanmedian, by using the new functionality in numpy, i.e., numpy sort has worked with nans for a while now. It looks like call overhead dominates the nanmax times for small arrays and this might improve if the ufunc machinery is cleaned up a bit more, I don't know how far Mark got with that.
Currently Bottleneck accelerates 1d, 2d, and 3d input. Anything else falls back to a slower, non-cython version of the function. The same goes for int32, int64, float32, float64. It should not be difficult to extend to higher nd and more dtypes since everything is generated from template. The problem is that there would be a LOT of cython auto-generated C code since there is a separate function for each ndim, dtype, axis combination. Each of the ndim, dtype, axis functions currently has its own copy of the algorithm (such as median). Pulling that out and reusing it should save a lot of trees by reducing the auto-generated C code size. I recently added a partsort and argpartsort.
I'd love to see something like a "count_unique" function included. The
numpy.unique function is handy, but it can be a little awkward to
efficiently go back and get counts of each unique value after the
fact.
-Mark
On Wed, Jun 1, 2011 at 8:17 AM, Keith Goodman
On Tue, May 31, 2011 at 8:41 PM, Charles R Harris
wrote: On Tue, May 31, 2011 at 8:50 PM, Bruce Southey
wrote: How about including all or some of Keith's Bottleneck package? He has tried to include some of the discussed functions and tried to make them very fast.
I don't think they are sufficiently general as they are limited to 2 dimensions. However, I think the moving filters should go into scipy, either in ndimage or maybe signals. Some of the others we can still speed of significantly, for instance nanmedian, by using the new functionality in numpy, i.e., numpy sort has worked with nans for a while now. It looks like call overhead dominates the nanmax times for small arrays and this might improve if the ufunc machinery is cleaned up a bit more, I don't know how far Mark got with that.
Currently Bottleneck accelerates 1d, 2d, and 3d input. Anything else falls back to a slower, non-cython version of the function. The same goes for int32, int64, float32, float64.
It should not be difficult to extend to higher nd and more dtypes since everything is generated from template. The problem is that there would be a LOT of cython auto-generated C code since there is a separate function for each ndim, dtype, axis combination.
Each of the ndim, dtype, axis functions currently has its own copy of the algorithm (such as median). Pulling that out and reusing it should save a lot of trees by reducing the auto-generated C code size.
I recently added a partsort and argpartsort. _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Short-circuiting find would be nice. Right now, to 'find' something you first make a bool array, then iterate over it. If all you want is the first index where x[i] = e, not very efficient. What I just described is a find with a '==' predicate. Not sure if it's worthwhile to consider other predicates. Maybe call it 'find_first' Mark Miller wrote:
I'd love to see something like a "count_unique" function included. The numpy.unique function is handy, but it can be a little awkward to efficiently go back and get counts of each unique value after the fact.
-Mark
On Wed, Jun 1, 2011 at 8:17 AM, Keith Goodman
wrote: On Tue, May 31, 2011 at 8:41 PM, Charles R Harris
wrote: On Tue, May 31, 2011 at 8:50 PM, Bruce Southey
wrote: How about including all or some of Keith's Bottleneck package? He has tried to include some of the discussed functions and tried to make them very fast.
I don't think they are sufficiently general as they are limited to 2 dimensions. However, I think the moving filters should go into scipy, either in ndimage or maybe signals. Some of the others we can still speed of significantly, for instance nanmedian, by using the new functionality in numpy, i.e., numpy sort has worked with nans for a while now. It looks like call overhead dominates the nanmax times for small arrays and this might improve if the ufunc machinery is cleaned up a bit more, I don't know how far Mark got with that.
Currently Bottleneck accelerates 1d, 2d, and 3d input. Anything else falls back to a slower, non-cython version of the function. The same goes for int32, int64, float32, float64.
It should not be difficult to extend to higher nd and more dtypes since everything is generated from template. The problem is that there would be a LOT of cython auto-generated C code since there is a separate function for each ndim, dtype, axis combination.
Each of the ndim, dtype, axis functions currently has its own copy of the algorithm (such as median). Pulling that out and reusing it should save a lot of trees by reducing the auto-generated C code size.
I recently added a partsort and argpartsort. _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
On Wed, Jun 1, 2011 at 11:31 AM, Mark Miller
I'd love to see something like a "count_unique" function included. The numpy.unique function is handy, but it can be a little awkward to efficiently go back and get counts of each unique value after the fact.
Does bincount do what you're after? [~/] [1]: x = np.random.randint(1,6,5) [~/] [2]: x [2]: array([1, 3, 4, 5, 1]) [~/] [3]: np.bincount(x) [3]: array([0, 2, 0, 1, 1, 1]) Skipper
Not quite. Bincount is fine if you have a set of approximately sequential numbers. But if you don't....
a = numpy.array((1,500,1000)) a array([ 1, 500, 1000]) b = numpy.bincount(a) b array([0, 1, 0, ..., 0, 0, 1]) len(b) 1001
-Mark
On Wed, Jun 1, 2011 at 9:32 AM, Skipper Seabold
On Wed, Jun 1, 2011 at 11:31 AM, Mark Miller
wrote: I'd love to see something like a "count_unique" function included. The numpy.unique function is handy, but it can be a little awkward to efficiently go back and get counts of each unique value after the fact.
Does bincount do what you're after?
[~/] [1]: x = np.random.randint(1,6,5)
[~/] [2]: x [2]: array([1, 3, 4, 5, 1])
[~/] [3]: np.bincount(x) [3]: array([0, 2, 0, 1, 1, 1])
Skipper _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
On Thu, Jun 2, 2011 at 1:49 AM, Mark Miller
Not quite. Bincount is fine if you have a set of approximately sequential numbers. But if you don't....
Even worse, it fails miserably if you sequential numbers but with a high shift. np.bincount([100000001, 100000002]) # will take a lof of memory Doing bincount with dict is faster in those cases. cheers, David
On Thu, Jun 2, 2011 at 1:49 AM, Mark Miller
wrote: Not quite. Bincount is fine if you have a set of approximately sequential numbers. But if you don't....
On 6/1/2011 9:35 PM, David Cournapeau wrote:
Even worse, it fails miserably if you sequential numbers but with a high shift. np.bincount([100000001, 100000002]) # will take a lof of memory Doing bincount with dict is faster in those cases.
Since this discussion has turned shortcomings of bincount, may I ask why np.bincount([]) is not an empty array? Even more puzzling, why is np.bincount([],minlength=6) not a 6-array of zeros? Use case: bincount of infected individuals by number of contacts. (In some periods there may be no infections.) Thank you, Alan Isaac PS A collections.Counter works pretty nice for Mark and David's cases, aside from the fact that the keys are not sorted.
On Wed, Jun 1, 2011 at 10:10 PM, Alan G Isaac
On Thu, Jun 2, 2011 at 1:49 AM, Mark Miller
wrote: Not quite. Bincount is fine if you have a set of approximately sequential numbers. But if you don't....
On 6/1/2011 9:35 PM, David Cournapeau wrote:
Even worse, it fails miserably if you sequential numbers but with a high shift. np.bincount([100000001, 100000002]) # will take a lof of memory Doing bincount with dict is faster in those cases.
Since this discussion has turned shortcomings of bincount, may I ask why np.bincount([]) is not an empty array? Even more puzzling, why is np.bincount([],minlength=6) not a 6-array of zeros?
Just looks like it wasn't coded that way, but it's low-hanging fruit. Any objections to adding this behavior? This commit should take care of it. Tests pass. Comments welcome, as I'm just getting my feet wet here. https://github.com/jseabold/numpy/commit/133148880bba5fa3a11dfbb95cefb3da4f7... Skipper
On Thu, Jun 2, 2011 at 12:08, Skipper Seabold
On Wed, Jun 1, 2011 at 10:10 PM, Alan G Isaac
wrote: On Thu, Jun 2, 2011 at 1:49 AM, Mark Miller
wrote: Not quite. Bincount is fine if you have a set of approximately sequential numbers. But if you don't....
On 6/1/2011 9:35 PM, David Cournapeau wrote:
Even worse, it fails miserably if you sequential numbers but with a high shift. np.bincount([100000001, 100000002]) # will take a lof of memory Doing bincount with dict is faster in those cases.
Since this discussion has turned shortcomings of bincount, may I ask why np.bincount([]) is not an empty array? Even more puzzling, why is np.bincount([],minlength=6) not a 6-array of zeros?
Just looks like it wasn't coded that way, but it's low-hanging fruit. Any objections to adding this behavior? This commit should take care of it. Tests pass. Comments welcome, as I'm just getting my feet wet here.
https://github.com/jseabold/numpy/commit/133148880bba5fa3a11dfbb95cefb3da4f7...
I would use np.zeros(5, dtype=int) in test_empty_with_minlength(), but otherwise, it looks good. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco
On Thu, Jun 2, 2011 at 1:11 PM, Robert Kern
On Thu, Jun 2, 2011 at 12:08, Skipper Seabold
wrote: On Wed, Jun 1, 2011 at 10:10 PM, Alan G Isaac
wrote: On Thu, Jun 2, 2011 at 1:49 AM, Mark Miller
wrote: Not quite. Bincount is fine if you have a set of approximately sequential numbers. But if you don't....
On 6/1/2011 9:35 PM, David Cournapeau wrote:
Even worse, it fails miserably if you sequential numbers but with a high shift. np.bincount([100000001, 100000002]) # will take a lof of memory Doing bincount with dict is faster in those cases.
Since this discussion has turned shortcomings of bincount, may I ask why np.bincount([]) is not an empty array? Even more puzzling, why is np.bincount([],minlength=6) not a 6-array of zeros?
Just looks like it wasn't coded that way, but it's low-hanging fruit. Any objections to adding this behavior? This commit should take care of it. Tests pass. Comments welcome, as I'm just getting my feet wet here.
https://github.com/jseabold/numpy/commit/133148880bba5fa3a11dfbb95cefb3da4f7...
I would use np.zeros(5, dtype=int) in test_empty_with_minlength(), but otherwise, it looks good.
Ok, thanks. Made the change and removed the old test that it fails on empty. Pull request. https://github.com/numpy/numpy/pull/84 Skipper
On Wed, Jun 1, 2011 at 9:35 PM, David Cournapeau
On Thu, Jun 2, 2011 at 1:49 AM, Mark Miller
wrote: Not quite. Bincount is fine if you have a set of approximately sequential numbers. But if you don't....
Even worse, it fails miserably if you sequential numbers but with a high shift.
np.bincount([100000001, 100000002]) # will take a lof of memory
Doing bincount with dict is faster in those cases.
same with negative numbers, but in these cases I just subtract the min and we are back to the fast bincount case cheers, Josef
cheers,
David _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
On Thu, Jun 2, 2011 at 5:42 PM,
On Wed, Jun 1, 2011 at 9:35 PM, David Cournapeau
wrote: On Thu, Jun 2, 2011 at 1:49 AM, Mark Miller
wrote: Not quite. Bincount is fine if you have a set of approximately sequential numbers. But if you don't....
Even worse, it fails miserably if you sequential numbers but with a high shift.
np.bincount([100000001, 100000002]) # will take a lof of memory
Doing bincount with dict is faster in those cases.
same with negative numbers, but in these cases I just subtract the min and we are back to the fast bincount case
Indeed, you can also deal with large numbers which are not consecutive by using a lookup-table. All those methods are quite error-prone in general, and reallly, there is no reason why bincount could not handle the general case, cheers, David
On 06/01/2011 10:08 AM, Charles R Harris wrote:
Hi All,
I've been contemplating new functions that could be added to numpy and thought I'd run them by folks to see if there is any interest.
1) Modified sort/argsort functions that return the maximum k values. This is easy to do with heapsort and almost as easy with mergesort.
2) Ufunc fadd (nanadd?) Treats nan as zero in addition. Should make a faster version of nansum possible.
3) Fast medians.
+1 for fast median as well, and more generally fast "linear" (O(kN)) order statistics would be nice. cheers, David
On Tue, May 31, 2011 at 7:33 PM, David
On 06/01/2011 10:08 AM, Charles R Harris wrote:
Hi All,
I've been contemplating new functions that could be added to numpy and thought I'd run them by folks to see if there is any interest.
1) Modified sort/argsort functions that return the maximum k values. This is easy to do with heapsort and almost as easy with mergesort.
2) Ufunc fadd (nanadd?) Treats nan as zero in addition. Should make a faster version of nansum possible.
3) Fast medians.
+1 for fast median as well, and more generally fast "linear" (O(kN)) order statistics would be nice.
OK, noob question. What are order statistics? Chuck
On 06/01/2011 10:34 AM, Charles R Harris wrote:
On Tue, May 31, 2011 at 7:33 PM, David
mailto:david@silveregg.co.jp> wrote: On 06/01/2011 10:08 AM, Charles R Harris wrote: > Hi All, > > I've been contemplating new functions that could be added to numpy and > thought I'd run them by folks to see if there is any interest. > > 1) Modified sort/argsort functions that return the maximum k values. > This is easy to do with heapsort and almost as easy with mergesort. > > 2) Ufunc fadd (nanadd?) Treats nan as zero in addition. Should make a > faster version of nansum possible. > > 3) Fast medians.
+1 for fast median as well, and more generally fast "linear" (O(kN)) order statistics would be nice.
OK, noob question. What are order statistics?
In statistics, order statistics are statistics based on sorted samples, median, min and max being the most common: http://en.wikipedia.org/wiki/Order_statistic Concretely here, I meant a fast way to compute any rank of a given data set, e.g. with the select algorithm. I wanted to do that for some time, but never took the time for it, David
would anyone object to fixing the numpy mean and stdv functions, so that they always used a 64-bit value to track sums, or so that they used a running calculation. That way np.mean(np.zeros([4000,4000],'f4')+500) would not equal 511.493408? ` On May 31, 2011, at 6:08 PM, Charles R Harris wrote:
Hi All,
I've been contemplating new functions that could be added to numpy and thought I'd run them by folks to see if there is any interest.
1) Modified sort/argsort functions that return the maximum k values. This is easy to do with heapsort and almost as easy with mergesort.
2) Ufunc fadd (nanadd?) Treats nan as zero in addition. Should make a faster version of nansum possible.
3) Fast medians.
Chuck _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
On Wed, Jun 1, 2011 at 10:44, Craig Yoshioka
would anyone object to fixing the numpy mean and stdv functions, so that they always used a 64-bit value to track sums, or so that they used a running calculation. That way
np.mean(np.zeros([4000,4000],'f4')+500)
would not equal 511.493408?
Yes, I object. You can set the accumulator dtype explicitly if you need it: np.mean(arr, dtype=np.float64) -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco
On 06/01/2011 11:01 AM, Robert Kern wrote:
On Wed, Jun 1, 2011 at 10:44, Craig Yoshioka
wrote: would anyone object to fixing the numpy mean and stdv functions, so that they always used a 64-bit value to track sums, or so that they used a running calculation. That way
np.mean(np.zeros([4000,4000],'f4')+500)
would not equal 511.493408? Yes, I object. You can set the accumulator dtype explicitly if you need it: np.mean(arr, dtype=np.float64)
np.mean(np.zeros([4000,4000],'f4')+500).dtype
Sure but fails to address that the output dtype of mean in this case is np.float64 which one would naively assume is also np.float64: dtype('float64') Thus, we have: Tickets 465 and 518 http://projects.scipy.org/numpy/ticket/465 http://projects.scipy.org/numpy/ticket/518 Various threads as well such as: http://mail.scipy.org/pipermail/numpy-discussion/2010-December/054253.html Bruce
On Wed, Jun 1, 2011 at 11:11, Bruce Southey
On 06/01/2011 11:01 AM, Robert Kern wrote:
On Wed, Jun 1, 2011 at 10:44, Craig Yoshioka
wrote: would anyone object to fixing the numpy mean and stdv functions, so that they always used a 64-bit value to track sums, or so that they used a running calculation. That way
np.mean(np.zeros([4000,4000],'f4')+500)
would not equal 511.493408? Yes, I object. You can set the accumulator dtype explicitly if you need it: np.mean(arr, dtype=np.float64)
Sure but fails to address that the output dtype of mean in this case is np.float64 which one would naively assume is also np.float64: >>> np.mean(np.zeros([4000,4000],'f4')+500).dtype dtype('float64')
Of course my statement fails to address that because that wasn't what I was responding to. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco
yes, and its probably slower to boot. A quick benchmark on my computer shows that: a = np.zeros([4000,4000],'f4')+500 np.mean(a) takes 0.02 secs np.mean(a,dtype=np.float64) takes 0.1 secs np.mean(a.astype(np.float64)) takes 0.06 secs so casting the whole array is almost 40% faster than setting the accumulator type!, I would imagine having the type-casting being done on the fly during the mean computation would be even faster. On Jun 1, 2011, at 9:11 AM, Bruce Southey wrote:
On 06/01/2011 11:01 AM, Robert Kern wrote:
On Wed, Jun 1, 2011 at 10:44, Craig Yoshioka
wrote: would anyone object to fixing the numpy mean and stdv functions, so that they always used a 64-bit value to track sums, or so that they used a running calculation. That way
np.mean(np.zeros([4000,4000],'f4')+500)
would not equal 511.493408? Yes, I object. You can set the accumulator dtype explicitly if you need it: np.mean(arr, dtype=np.float64)
np.mean(np.zeros([4000,4000],'f4')+500).dtype
Sure but fails to address that the output dtype of mean in this case is np.float64 which one would naively assume is also np.float64: dtype('float64')
Thus, we have: Tickets 465 and 518 http://projects.scipy.org/numpy/ticket/465 http://projects.scipy.org/numpy/ticket/518
Various threads as well such as: http://mail.scipy.org/pipermail/numpy-discussion/2010-December/054253.html
Bruce _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
On 5/31/11 6:08 PM, Charles R Harris wrote:
2) Ufunc fadd (nanadd?) Treats nan as zero in addition.
so: In [53]: a Out[53]: array([ 1., 2., nan]) In [54]: b Out[54]: array([0, 1, 2]) In [55]: a + b Out[55]: array([ 1., 3., nan]) and nanadd(a,b) would yield: array([ 1., 3., 2.) I don't see how that is particularly useful, at least not any more useful that nanprod, nandiv, etc, etc... What am I missing? -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
On Wed, Jun 1, 2011 at 12:30, Christopher Barker
On 5/31/11 6:08 PM, Charles R Harris wrote:
2) Ufunc fadd (nanadd?) Treats nan as zero in addition.
so:
In [53]: a Out[53]: array([ 1., 2., nan])
In [54]: b Out[54]: array([0, 1, 2])
In [55]: a + b Out[55]: array([ 1., 3., nan])
and nanadd(a,b) would yield:
array([ 1., 3., 2.)
I don't see how that is particularly useful, at least not any more useful that nanprod, nandiv, etc, etc...
What am I missing?
It's mostly useful for its .reduce() and .accumulate() form. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco
My favorite missing extension to numpy functions np.bincount with 2 (or more) dimensional weights for fast calculation of group statistics. Josef
On May 31, 2011, at 8:08 PM, Charles R Harris wrote:
Hi All,
I've been contemplating new functions that could be added to numpy and thought I'd run them by folks to see if there is any interest.
1) Modified sort/argsort functions that return the maximum k values. This is easy to do with heapsort and almost as easy with mergesort.
+1
2) Ufunc fadd (nanadd?) Treats nan as zero in addition. Should make a faster version of nansum possible.
+0 --- Some discussion at the data array summit led to the view that supporting nan-enabled dtypes (nanfloat64, nanfloat32, nanint64) might be a better approach for nan-handling. This needs more discussion, but useful to mention in this context.
3) Fast medians.
Definitely order-statistics would be the right thing to handle generally. -Travis
Chuck _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
--- Travis Oliphant Enthought, Inc. oliphant@enthought.com 1-512-536-1057 http://www.enthought.com
On Wed, Jun 1, 2011 at 22:06, Travis Oliphant
On May 31, 2011, at 8:08 PM, Charles R Harris wrote:
2) Ufunc fadd (nanadd?) Treats nan as zero in addition. Should make a faster version of nansum possible.
+0 --- Some discussion at the data array summit led to the view that supporting nan-enabled dtypes (nanfloat64, nanfloat32, nanint64) might be a better approach for nan-handling. This needs more discussion, but useful to mention in this context.
Actually, these are completely orthogonal to Chuck's proposal. The NA-enabled dtypes (*not* NaN-enabled dtypes) would have (x + NA) == NA, just like R. fadd() would be useful for other things. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco
participants (16)
-
Alan G Isaac
-
Benjamin Root
-
Bruce Southey
-
Charles R Harris
-
Christopher Barker
-
Craig Yoshioka
-
David
-
David Cournapeau
-
josef.pktd@gmail.com
-
Keith Goodman
-
Mark Miller
-
Neal Becker
-
Robert Kern
-
Skipper Seabold
-
Travis Oliphant
-
Warren Weckesser