[Tutor] sorting the list

Thu Feb 27 16:29:01 2003

Don Arnold wrote:

>>>>>mysort() - 1.982278 elapsed
>>>>>DSU      - 3.472156 elapsed
>>>>>          
>>>>>
>
>It looks like using DSU is actually slower than providing your own sort
>method. This agrees with my
>intuition that the overhead of creating the decorated list then undecorating
>it can be sizable, but we
>all know intuition can often be wrong. ; )  Is this really the case, or is
>there a problem with my
>implementation?
>  
>

I think it's partially a matter of the specifics of this case.  Because 
this data requires an extra int() call for every element, the time taken 
by creating all those ints starts to swamp the time taken by the extra 
cmp() function calls.  Typically, DSU is used to sort by the nth 
(existing) element of a sequence, rather than having to create new 
objects to populate the decorated list.  Also, given the fact that (for 
the test data used) every comparison is being resolved by the second 
element, the DSU requires more comparisons than your sort.  I suspect 
that if the test data included many different values for the initial 
letter, the times would be much closer.  

More importantly, though, you're doing DSU wrong.  Instead of decorating 
and undecorating, you're decomposing and recreating.  You should tack 
the original item as the final element of the tuple, and then simply 
extract that element from each item in the list when you undecorate. 
 Here's my tests.

 >>> import time
 >>> def timetest(func, testlist)
...     starttime = time.clock()
...     func(testlist)
...     endtime = time.clock()
...     print "%10s -- %06f elapsed" % (func.__name__, endtime - starttime)
...    
 >>> def mysort(lhs, rhs):
...     if lhs[0] != rhs[0]:
...         return cmp(lhs[0], rhs[0])
...     else:
...         return cmp(int(lhs[1:]), int(rhs[1:]))
...    
 >>> def test_mysort(testlist):
...     testlist.sort(mysort)
...    
 >>> def test_DSU(testlist):
...     D = [(e[0], int(e[1:]), e) for e in testlist]
...     D.sort()
...     testlist = [item[-1] for item in D]
...    
 >>> a = ['a' + str(i) for i in xrange(100000,-1,-1)]
 >>> timetest(test_mysort, a)
test_mysort -- 1.468711 elapsed
 >>> a = ['a' + str(i) for i in xrange(100000,-1,-1)]
 >>> timetest(test_DSU, a)
  test_DSU -- 1.708165 elapsed
 >>>

So, simply by fixing the DSU algorithm, we've made them fairly 
comparable.  Now, let's try it with a wider assortment of initial letters:

 >>> import string
 >>> a = [string.lowercase[i % 26] + str(i) for i in xrange(100000,-1,-1)]
 >>> timetest(test_mysort, a)
test_mysort -- 19.758318 elapsed
 >>> a = [string.lowercase[i % 26] + str(i) for i in xrange(100000,-1,-1)]
 >>> timetest(test_DSU, a)
test_DSU -- 4.045212 elapsed

Now, *there's* a significant difference!  So it looks like the main 
advantage of your sort routine was that it had some short-circuiting 
built in, which happened to be advantageous with the data set you were 
testing but isn't always going to make much difference.  DSU is pretty 
close to equal even in that special case, but is well ahead in the more 
general case.

Jeff Shannon
Technician/Programmer
Credit International