get best few of many: argsort( few= ) using std::partial_sort ?
Folks, to get the best few of a large number of objects, e.g. vectors near a given one, or small distances in spatial.distance.cdist or .pdist, argsort( bigArray )[: a few ] is not so hot. It would be nice if argsort( bigArray, few= ) did this -- faster, save mem too. Would anyone else find this useful ? I recently stumbled across partial_sort in stl; fwiw, std:partial_sort( A, A + sqrt(N), A + N ) is ~ 10 times faster than std:sort on my old mac ppc, even for N 100. Also fwiw, nth_element alone is ~ twice as slow as partial_sort -- odd. cheers -- denis
On Thu, Jul 8, 2010 at 10:17 AM, denis <denis-bz-gg@t-online.de> wrote:
Folks, to get the best few of a large number of objects, e.g. vectors near a given one, or small distances in spatial.distance.cdist or .pdist, argsort( bigArray )[: a few ] is not so hot. It would be nice if argsort( bigArray, few= ) did this -- faster, save mem too. Would anyone else find this useful ?
I recently stumbled across partial_sort in stl; fwiw, std:partial_sort( A, A + sqrt(N), A + N ) is ~ 10 times faster than std:sort on my old mac ppc, even for N 100. Also fwiw, nth_element alone is ~ twice as slow as partial_sort -- odd.
I think a lot of people would like a partial sort. It comes up on the list now and then. There's a ticket with a cython partial sort: http://projects.scipy.org/numpy/ticket/1213 Here's the docstring: def select(a, k, inplace=False): ''' Wirth's version of Hoare's quick select Parameters ---------- a : array_like k : integer inplace : boolean The partial sort is done inplace if a is a contiguous ndarray and inplace=True. Default: False. Returns ------- out : ndarray Partially sorted a such that out[k] is the k largest element. Elements smaller than out[k] are unsorted in out[:k]. Elements larger than out[k] are unsorted in out[k:]. '''
Just to complicate the issue somewhat, it might be valuable to have both k-largest and k-smallest; also, one should think about what happens to NaNs and infinities. Further, it might be worth including a fast median calculation (i.e. find the median element(s) without completely sorting the array). Anne On 8 July 2010 13:30, Keith Goodman <kwgoodman@gmail.com> wrote:
On Thu, Jul 8, 2010 at 10:17 AM, denis <denis-bz-gg@t-online.de> wrote:
Folks, to get the best few of a large number of objects, e.g. vectors near a given one, or small distances in spatial.distance.cdist or .pdist, argsort( bigArray )[: a few ] is not so hot. It would be nice if argsort( bigArray, few= ) did this -- faster, save mem too. Would anyone else find this useful ?
I recently stumbled across partial_sort in stl; fwiw, std:partial_sort( A, A + sqrt(N), A + N ) is ~ 10 times faster than std:sort on my old mac ppc, even for N 100. Also fwiw, nth_element alone is ~ twice as slow as partial_sort -- odd.
I think a lot of people would like a partial sort. It comes up on the list now and then. There's a ticket with a cython partial sort:
http://projects.scipy.org/numpy/ticket/1213
Here's the docstring:
def select(a, k, inplace=False): ''' Wirth's version of Hoare's quick select
Parameters ---------- a : array_like k : integer inplace : boolean The partial sort is done inplace if a is a contiguous ndarray and inplace=True. Default: False.
Returns ------- out : ndarray Partially sorted a such that out[k] is the k largest element. Elements smaller than out[k] are unsorted in out[:k]. Elements larger than out[k] are unsorted in out[k:].
''' _______________________________________________ SciPy-User mailing list SciPy-User@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-user
On Thu, Jul 8, 2010 at 11:14 AM, Anne Archibald <aarchiba@physics.mcgill.ca> wrote:
Just to complicate the issue somewhat, it might be valuable to have both k-largest and k-smallest; also, one should think about what happens to NaNs and infinities. Further, it might be worth including a fast median calculation (i.e. find the median element(s) without completely sorting the array).
I should have mentioned the name of the ticket: median in average O(n) time.
On Thu, Jul 8, 2010 at 12:19 PM, Keith Goodman <kwgoodman@gmail.com> wrote:
On Thu, Jul 8, 2010 at 11:14 AM, Anne Archibald <aarchiba@physics.mcgill.ca> wrote:
Just to complicate the issue somewhat, it might be valuable to have both k-largest and k-smallest; also, one should think about what happens to NaNs and infinities. Further, it might be worth including a fast median calculation (i.e. find the median element(s) without completely sorting the array).
I should have mentioned the name of the ticket: median in average O(n) time.
Yeah, I've got that. I was thinking of adding it as a generalized ufunc but haven't gotten around to it. Chuck
On Thu, Jul 8, 2010 at 11:29 AM, Charles R Harris <charlesr.harris@gmail.com> wrote:
On Thu, Jul 8, 2010 at 12:19 PM, Keith Goodman <kwgoodman@gmail.com> wrote:
On Thu, Jul 8, 2010 at 11:14 AM, Anne Archibald <aarchiba@physics.mcgill.ca> wrote:
Just to complicate the issue somewhat, it might be valuable to have both k-largest and k-smallest; also, one should think about what happens to NaNs and infinities. Further, it might be worth including a fast median calculation (i.e. find the median element(s) without completely sorting the array).
I should have mentioned the name of the ticket: median in average O(n) time.
Yeah, I've got that. I was thinking of adding it as a generalized ufunc but haven't gotten around to it.
You'll be a hero when you do it. There's got to be some ticker tape around here somewhere...
Folks, you've probably seen http://www.sgi.com/tech/stl/partial_sort.html -- sort uses the introsort algorithm and partial_sort uses heapsort introsort is usually faster by a factor of 2 to 5 Heapsort fans: the normal pivot selector (without medians) is blind; trying for a pivot near a small k/n percentile is fun. Any comments on argsort( few= ) vs a separate partial_sort() ? cheers -- denis
On Fri, Jul 9, 2010 at 8:33 AM, denis <denis-bz-gg@t-online.de> wrote:
Folks, you've probably seen http://www.sgi.com/tech/stl/partial_sort.html -- sort uses the introsort algorithm and partial_sort uses heapsort introsort is usually faster by a factor of 2 to 5
Sounds like partial_sort just sets up the heap and then pulls off the needed elements. That would make it about twice as fast as the normal heapsort for a small number of values, or about the same as a full quicksort.
Heapsort fans: the normal pivot selector (without medians) is blind; trying for a pivot near a small k/n percentile is fun.
You mean quicksort?
Any comments on argsort( few= ) vs a separate partial_sort() ?
Chuck
On Jul 9, 5:46 pm, Charles R Harris <charlesr.har...@gmail.com> wrote:
On Fri, Jul 9, 2010 at 8:33 AM, denis <denis-bz...@t-online.de> wrote:
Folks, you've probably seenhttp://www.sgi.com/tech/stl/partial_sort.html -- sort uses the introsort algorithm and partial_sort uses heapsort introsort is usually faster by a factor of 2 to 5
Sounds like partial_sort just sets up the heap and then pulls off the needed elements. That would make it about twice as fast as the normal heapsort for a small number of values, or about the same as a full quicksort.
Yes, at least in STLport _algo.c -- not so hot. What with factors of 2 from various *sort, arch, cache, inline operator< vs lessf() the possible gain of partial_sort is melting away ...
Heapsort fans: the normal pivot selector (without medians) is blind; trying for a pivot near a small k/n percentile is fun.
You mean quicksort? yes, my mistake
On 2010-07-08, at 1:17 PM, denis wrote:
argsort( bigArray )[: a few ] is not so hot. It would be nice if argsort( bigArray, few= ) did this -- faster, save mem too. Would anyone else find this useful ?
Quite likely. I've seen these sorts of algorithms written up elsewhere and they indeed have favourable time complexity, but I've never had sufficient need to go and implement them. I don't think we'd want to introduce C++ dependencies in NumPy though, one would probably want to modify the existing sort/argsort machinery. David
On Thu, Jul 8, 2010 at 10:17 AM, denis <denis-bz-gg@t-online.de> wrote:
Folks, to get the best few of a large number of objects, e.g. vectors near a given one, or small distances in spatial.distance.cdist or .pdist, argsort( bigArray )[: a few ] is not so hot. It would be nice if argsort( bigArray, few= ) did this -- faster, save mem too. Would anyone else find this useful ?
I recently stumbled across partial_sort in stl; fwiw, std:partial_sort( A, A + sqrt(N), A + N ) is ~ 10 times faster than std:sort on my old mac ppc, even for N 100. Also fwiw, nth_element alone is ~ twice as slow as partial_sort -- odd.
I recently added partial sorting to bottleneck (0.5.0dev): partsort() and argpartsort(). Make an array: >> import bottleneck as bn >> a = np.random.rand(1000000) Timing: >> timeit np.sort(a) 10 loops, best of 3: 114 ms per loop >> timeit bn.partsort(a, n=1000) 100 loops, best of 3: 3.95 ms per loop >> timeit np.argsort(a) 10 loops, best of 3: 161 ms per loop >> timeit bn.argpartsort(a, n=1000) 100 loops, best of 3: 11.4 ms per loop The speed up is more like a factor 2 to 5 for 2d input and n = a.shape[axis] / 2. You mentioned memory usage. I can reduce that from my current implementation by one copy of the input array if it is a problem. bn.partsort doc string: partsort(arr, n, axis=-1) Partial sorting of array elements along given axis. A partially sorted array is one in which the `n` smallest values appear (in any order) in the first `n` elements. The remaining largest elements are also unordered. Due to the algorithm used (Wirth's method), the nth smallest element is in its sorted position (at index `n-1`). Shuffling the input array may change the output. The only guarantee is that the first `n` elements will be the `n` smallest and the remaining element will appear in the remainder of the output. This functions is not protected against NaN. Therefore, you may get unexpected results if the input contains NaN. Parameters ---------- arr : array_like Input array. If `arr` is not an array, a conversion is attempted. n : int The `n` smallest elements will appear (unordered) in the first `n` elements of the output array. axis : {int, None}, optional Axis along which the partial sort is performed. The default (axis=-1) is to sort along the last axis. Returns ------- y : ndarray A partially sorted copy of the input array where the `n` smallest elements will appear (unordered) in the first `n` elements. See Also -------- bottleneck.argpartsort: Indices that would partially sort an array Notes ----- Unexpected results may occur if the input array contains NaN. Examples -------- Create a numpy array: >>> a = np.array([1, 0, 3, 4, 2]) Partially sort array so that the first 3 elements are the smallest 3 elements (note, as in this example, that the smallest 3 elements may not be sorted): >>> bn.partsort(a, n=3) array([1, 0, 2, 4, 3])
participants (5)
-
Anne Archibald -
Charles R Harris -
David Warde-Farley -
denis -
Keith Goodman