[Tutor] sorting algorithm

Wed Mar 3 16:25:27 CET 2010

C.T. Matsumoto wrote:
> Hello,
>
> This is follow up on a question I had about algorithms. In the thread 
> it was suggested I make my own sorting algorithm.
>
> Here are my results.
>
> #!/usr/bin/python
>
> def sort_(list_):
>    for item1 in list_:
>        pos1 = list_.index(item1)
>        pos2 = pos1 + 1
>        try:
>            item2 = list_[pos2]
>        except IndexError:
>            pass
>
>        if item1 >= item2:
>            try:
>                list_.pop(pos2)
>                list_.insert(pos1, item2)
>                return True
>            except IndexError:
>                pass
>
> def mysorter(list_):
>    while sort_(list_) is True:
>        sort_(list_)
>
> I found this to be a great exercise. In doing the exercise, I got 
> pretty stuck. I consulted another programmer (my dad) who described 
> how to go about sorting. As it turned out the description he described 
> was the Bubble sort algorithm. Since coding the solution I know the 
> Bubble sort is inefficient because of repeated iterations over the 
> entire list. This shed light on the quick sort algorithm which I'd 
> like to have a go at.
>
> Something I haven't tried is sticking in really large lists. I was 
> told that with really large list you break down the input list into 
> smaller lists. Sort each list, then go back and use the same swapping 
> procedure for each of the different lists. My question is, at what 
> point to you start breaking things up? Is that based on list elements 
> or is it based on memory(?) resources python is using?
>
> One thing I'm not pleased about is the while loop and I'd like to 
> replace it with a for loop.
>
> Thanks,
>
> T
>
>
There are lots of references on the web about Quicksort, including a 
video at:
     http://www.youtube.com/watch?v=y_G9BkAm6B8

which I think illustrates it pretty well.  It would be a great learning 
exercise to implement Python code directly from that description, 
without using the sample C++ code available.

(Incidentally, there are lots of variants of Quicksort, so I'm not going 
to quibble about whether this is the "right" one to be called that.)

I don't know what your earlier thread was, since you don't mention the 
subject line, but there are a number of possible reasons you might not 
have wanted to use the built-in sort.  The best one is for educational 
purposes.  I've done my own sort for various reasons in the past, even 
though I had a library function, since the library function had some 
limits.  One time I recall, the situation was that the library sort was 
limited to 64k of total data, and I had to work with much larger arrays 
(this was in 16bit C++, in "large" model).  I solved the size problem by 
using the  C++ sort library on 16k subsets (because a pointer was 2*2 
bytes).  Then I merged the results of the sorts.  At the time, and in 
the circumstances involved, there were seldom more than a dozen or so 
sublists to merge, so this approach worked well enough.

Generally, it's better for both your development time and the efficiency 
and reliabilty of the end code, to base a new sort mechanism on the 
existing one.  In my case above, I was replacing what amounted to an 
insertion sort, and achieved a 50* improvement for a real customer.  It 
was fast enough that other factors completely dominated his running time.

But for learning purposes?  Great plan.  So now I'll respond to your 
other questions, and comment on your present algorithm.

It would be useful to understand about algorithmic complexity, the so 
called Order Function.  In a bubble sort, if you double the size of the 
array, you quadruple the number of comparisons and swaps.  It's order 
N-squared or O(n*n).   So what works well for an array of size 10 might 
take a very long time for an array of size 10000 (like a million times 
as long).  You can do much better by sorting smaller lists, and then 
combining them together.  Such an algorithm can  be O(n*log(n)).

You ask at what point you consider sublists?  In a language like C, the 
answer is when the list is size 3 or more.  For anything larger than 2, 
you divide into sublists, and work on them.

Now, if I may comment on your code.  You're modifying a list while 
you're iterating through it in a for loop.  In the most general case, 
that's undefined.  I think it's safe in this case, but I would avoid it 
anyway, by just using xrange(len(list_)-1) to iterate through it.  You 
use the index function to find something you would already know -- the 
index function is slow.  And the first try/except isn't needed if you 
use a -1 in the xrange argument, as I do above.

You use pop() and push() to exchange two adjacent items in the list.  
Both operations copy the remainder of the list, so they're rather slow.  
Since you're exchanging two items in the list, you can simply do that:
     list[pos1], list[pos2] = list[pos2], list[pos1]

That also eliminates the need for the second try/except.

You mention being bothered by the while loop.  You could replace it with 
a simple for loop with xrange(len(list_)), since you know that N passes 
will always be enough.  But if the list is partially sorted, your 
present scheme will end sooner.  And if it's fully sorted, it'll only 
take one pass over the data.

There are many refinements you could do.  For example, you don't have to 
stop the inner loop after the first swap.  You could finish the buffer, 
swapping any other pairs that are out of order.  You'd then be saving a 
flag indicating if you did any swaps.  You could keep a index pointing 
to the last pair you swapped on the previous pass, and use that for a 
limit next time.  Then you just terminate the outer loop when that limit 
value is 1.  You could even keep two limit values, and bubble back and 
forth between them, as they gradually close into the median of the 
list.  You quit when they collide in the middle.

The resultant function should be much faster for medium-sized lists, but 
it still will slow down quadratically as the list size increases.  You 
still need to divide and conquer, and quicksort is just one way of doing 
that.

DaveA