
cat numtst.py import Numeric
We have a speed problem with Numeric.sort on large arrays with only a few different values. Here is my example -- snip -- print Numeric.__version__ class timer: def __init__(self): import time self.start = time.time() def stop(self): import time print "%.3f" % (time.time() - self.start) from RandomArray import random from Numeric import sort, Int r=random((71400,)) t = timer() ; p=sort(r) ; t.stop() r=(random((71400,))*70000).astype(Int) t = timer() ; p=sort(r) ; t.stop() r=(random((71400,))*70).astype(Int) t = timer() ; p=sort(r) ; t.stop() r=(random((71400,))*7).astype(Int) t = timer() ; p=sort(r) ; t.stop() 16:27 hoel@seeve:hoel 2>python numtst.py 17.3.0 0.185 0.148 2.053 21.668 -- snip -- So the less different values are contained in the array the longer takes the sorting. Is this also the case with newer versions of Numeric (But this is Python 1.5.2)? Why is sorting of these arrays so slow? Thanks Berthold -- email: hoel@GermanLloyd.org ) tel. : +49 (40) 3 61 49 - 73 74 ( C[_] These opinions might be mine, but never those of my employer.

Hi Berthold, I tested your code on Win95 using Numeric 20.0.0 and got essentially the same pattern of times. So, upgrading your version of Numeric is not likely to help. Curious, I checked out the code for sort and found that it justs calls qsort from the C library. I suspect that the problem is related to quicksort having bad worst case behaviour. Not being a computer scientest, I can't tell you under what situations the bad behaviour is triggered, although I know it doesn't like presorted lists. Anyway, if you're using 1D arrays, one workaround would be to use list.sort. Python's sorting routines has, I understand, lots of gimmicks to avoid the problems that quicksort sometimes encounters. I tried it and this: r=(random((71400,))*7).astype(Int) l = r.tolist() l.sort() p = array(l) runs about 40 times faster than this: r=(random((71400,))*7).astype(Int) p = r.sort() If you don't need to convert to and from a array, this approach is 60 times faster. Even if you're dealing with a multidimensional array, this approach (in a loop) might be signifigantly faster assuming you're sorting along the long axis. It makes one wonder if using the python sort rather than qsort for Numeric.sort would be a profitable change. No time to investigate it right now though. Hope that's useful... -tim

Well, this is what I get with Python 2.1, NumPy 20.0.0, Linux on a 450Mhz PIII [cbarker@waves junk]$ python numtst.py 20.0.0 0.438 0.154 0.148 0.139 So whatever it was may have been fixed (or be a strange platform dependence). Note, you do want to be a bit careful about using time.time, as it measures real time, so if you have another process hogging resources, it will not be a fair measure. You can use time.closck instead, although I'm sure it has its issues as well. This is what I get with clock: [cbarker@waves junk]$ python numtst.py 20.0.0 0.440 0.160 0.140 0.140 There's not much going on on my machine right now, so little difference. -Chris -- Christopher Barker, Ph.D. ChrisHBarker@home.net --- --- --- http://members.home.net/barkerlohmann ---@@ -----@@ -----@@ ------@@@ ------@@@ ------@@@ Oil Spill Modeling ------ @ ------ @ ------ @ Water Resources Engineering ------- --------- -------- Coastal and Fluvial Hydrodynamics -------------------------------------- ------------------------------------------------------------------------

Chris Barker writes: [He doesn't get a slowdown with Numeric 20.0.0 on Linux]
It must just be the cleverness of your platform's qsort. Numeric delegates sorting to qsort and running qsort on Windows 2000 with Numeric 20.0.0 I get similar values to Bethold's. Also some rather naive calculations for the complexity of quicksort on a list of n items with k distinct values gives me O((log(n/k) + (n/k)) * n). That's probably not exactly right, but it matches the timings I get pretty well. -tim

I just took a look at the release notes for version 20 at: http://sourceforge.net/project/shownotes.php?release_id=31875 And it says: Release Name: 20 Notes: Requires Python 2.0 Shouldn't that be 2.1? or 2.0 or greater? -Chris -- Christopher Barker, Ph.D. ChrisHBarker@home.net --- --- --- http://members.home.net/barkerlohmann ---@@ -----@@ -----@@ ------@@@ ------@@@ ------@@@ Oil Spill Modeling ------ @ ------ @ ------ @ Water Resources Engineering ------- --------- -------- Coastal and Fluvial Hydrodynamics -------------------------------------- ------------------------------------------------------------------------

Hi Berthold, I tested your code on Win95 using Numeric 20.0.0 and got essentially the same pattern of times. So, upgrading your version of Numeric is not likely to help. Curious, I checked out the code for sort and found that it justs calls qsort from the C library. I suspect that the problem is related to quicksort having bad worst case behaviour. Not being a computer scientest, I can't tell you under what situations the bad behaviour is triggered, although I know it doesn't like presorted lists. Anyway, if you're using 1D arrays, one workaround would be to use list.sort. Python's sorting routines has, I understand, lots of gimmicks to avoid the problems that quicksort sometimes encounters. I tried it and this: r=(random((71400,))*7).astype(Int) l = r.tolist() l.sort() p = array(l) runs about 40 times faster than this: r=(random((71400,))*7).astype(Int) p = r.sort() If you don't need to convert to and from a array, this approach is 60 times faster. Even if you're dealing with a multidimensional array, this approach (in a loop) might be signifigantly faster assuming you're sorting along the long axis. It makes one wonder if using the python sort rather than qsort for Numeric.sort would be a profitable change. No time to investigate it right now though. Hope that's useful... -tim

Well, this is what I get with Python 2.1, NumPy 20.0.0, Linux on a 450Mhz PIII [cbarker@waves junk]$ python numtst.py 20.0.0 0.438 0.154 0.148 0.139 So whatever it was may have been fixed (or be a strange platform dependence). Note, you do want to be a bit careful about using time.time, as it measures real time, so if you have another process hogging resources, it will not be a fair measure. You can use time.closck instead, although I'm sure it has its issues as well. This is what I get with clock: [cbarker@waves junk]$ python numtst.py 20.0.0 0.440 0.160 0.140 0.140 There's not much going on on my machine right now, so little difference. -Chris -- Christopher Barker, Ph.D. ChrisHBarker@home.net --- --- --- http://members.home.net/barkerlohmann ---@@ -----@@ -----@@ ------@@@ ------@@@ ------@@@ Oil Spill Modeling ------ @ ------ @ ------ @ Water Resources Engineering ------- --------- -------- Coastal and Fluvial Hydrodynamics -------------------------------------- ------------------------------------------------------------------------

Chris Barker writes: [He doesn't get a slowdown with Numeric 20.0.0 on Linux]
It must just be the cleverness of your platform's qsort. Numeric delegates sorting to qsort and running qsort on Windows 2000 with Numeric 20.0.0 I get similar values to Bethold's. Also some rather naive calculations for the complexity of quicksort on a list of n items with k distinct values gives me O((log(n/k) + (n/k)) * n). That's probably not exactly right, but it matches the timings I get pretty well. -tim

I just took a look at the release notes for version 20 at: http://sourceforge.net/project/shownotes.php?release_id=31875 And it says: Release Name: 20 Notes: Requires Python 2.0 Shouldn't that be 2.1? or 2.0 or greater? -Chris -- Christopher Barker, Ph.D. ChrisHBarker@home.net --- --- --- http://members.home.net/barkerlohmann ---@@ -----@@ -----@@ ------@@@ ------@@@ ------@@@ Oil Spill Modeling ------ @ ------ @ ------ @ Water Resources Engineering ------- --------- -------- Coastal and Fluvial Hydrodynamics -------------------------------------- ------------------------------------------------------------------------
participants (3)
-
Chris Barker
-
hoel@germanlloyd.org
-
Tim Hochberg