[Tutor] sorting algorithm

Thu Mar 11 21:09:56 CET 2010

C.T. Matsumoto wrote:
> Dave Angel wrote:
>> (You forgot to do a Reply-All, so your message went to just me, 
>> rather than to me and the list )
>>
>>
>> C.T. Matsumoto wrote:
>>> Dave Angel wrote:
>>>> C.T. Matsumoto wrote:
>>>>> Hello,
>>>>>
>>>>> This is follow up on a question I had about algorithms. In the 
>>>>> thread it was suggested I make my own sorting algorithm.
>>>>>
>>>>> Here are my results.
>>>>>
>>>>> #!/usr/bin/python
>>>>>
>>>>> def sort_(list_):
>>>>>    for item1 in list_:
>>>>>        pos1 = list_.index(item1)
>>>>>        pos2 = pos1 + 1
>>>>>        try:
>>>>>            item2 = list_[pos2]
>>>>>        except IndexError:
>>>>>            pass
>>>>>
>>>>>        if item1 >= item2:
>>>>>            try:
>>>>>                list_.pop(pos2)
>>>>>                list_.insert(pos1, item2)
>>>>>                return True
>>>>>            except IndexError:
>>>>>                pass
>>>>>
>>>>> def mysorter(list_):
>>>>>    while sort_(list_) is True:
>>>>>        sort_(list_)
>>>>>
>>>>> I found this to be a great exercise. In doing the exercise, I got 
>>>>> pretty stuck. I consulted another programmer (my dad) who 
>>>>> described how to go about sorting. As it turned out the 
>>>>> description he described was the Bubble sort algorithm. Since 
>>>>> coding the solution I know the Bubble sort is inefficient because 
>>>>> of repeated iterations over the entire list. This shed light on 
>>>>> the quick sort algorithm which I'd like to have a go at.
>>>>>
>>>>> Something I haven't tried is sticking in really large lists. I was 
>>>>> told that with really large list you break down the input list 
>>>>> into smaller lists. Sort each list, then go back and use the same 
>>>>> swapping procedure for each of the different lists. My question 
>>>>> is, at what point to you start breaking things up? Is that based 
>>>>> on list elements or is it based on memory(?) resources python is 
>>>>> using?
>>>>>
>>>>> One thing I'm not pleased about is the while loop and I'd like to 
>>>>> replace it with a for loop.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> T
>>>>>
>>>>>
>>>> There are lots of references on the web about Quicksort, including 
>>>> a video at:
>>>>     http://www.youtube.com/watch?v=y_G9BkAm6B8
>>>>
>>>> which I think illustrates it pretty well.  It would be a great 
>>>> learning exercise to implement Python code directly from that 
>>>> description, without using the sample C++ code available.
>>>>
>>>> (Incidentally, there are lots of variants of Quicksort, so I'm not 
>>>> going to quibble about whether this is the "right" one to be called 
>>>> that.)
>>>>
>>>> I don't know what your earlier thread was, since you don't mention 
>>>> the subject line, but there are a number of possible reasons you 
>>>> might not have wanted to use the built-in sort.  The best one is 
>>>> for educational purposes.  I've done my own sort for various 
>>>> reasons in the past, even though I had a library function, since 
>>>> the library function had some limits.  One time I recall, the 
>>>> situation was that the library sort was limited to 64k of total 
>>>> data, and I had to work with much larger arrays (this was in 16bit 
>>>> C++, in "large" model).  I solved the size problem by using the  
>>>> C++ sort library on 16k subsets (because a pointer was 2*2 bytes).  
>>>> Then I merged the results of the sorts.  At the time, and in the 
>>>> circumstances involved, there were seldom more than a dozen or so 
>>>> sublists to merge, so this approach worked well enough.
>>>>
>>>> Generally, it's better for both your development time and the 
>>>> efficiency and reliabilty of the end code, to base a new sort 
>>>> mechanism on the existing one.  In my case above, I was replacing 
>>>> what amounted to an insertion sort, and achieved a 50* improvement 
>>>> for a real customer.  It was fast enough that other factors 
>>>> completely dominated his running time.
>>>>
>>>> But for learning purposes?  Great plan.  So now I'll respond to 
>>>> your other questions, and comment on your present algorithm.
>>>>
>>>> It would be useful to understand about algorithmic complexity, the 
>>>> so called Order Function.  In a bubble sort, if you double the size 
>>>> of the array, you quadruple the number of comparisons and swaps.  
>>>> It's order N-squared or O(n*n).   So what works well for an array 
>>>> of size 10 might take a very long time for an array of size 10000 
>>>> (like a million times as long).  You can do much better by sorting 
>>>> smaller lists, and then combining them together.  Such an algorithm 
>>>> can  be O(n*log(n)).
>>>>
>>>>
>>>> You ask at what point you consider sublists?  In a language like C, 
>>>> the answer is when the list is size 3 or more.  For anything larger 
>>>> than 2, you divide into sublists, and work on them.
>>>>
>>>> Now, if I may comment on your code.  You're modifying a list while 
>>>> you're iterating through it in a for loop.  In the most general 
>>>> case, that's undefined.  I think it's safe in this case, but I 
>>>> would avoid it anyway, by just using xrange(len(list_)-1) to 
>>>> iterate through it.  You use the index function to find something 
>>>> you would already know -- the index function is slow.  And the 
>>>> first try/except isn't needed if you use a -1 in the xrange 
>>>> argument, as I do above.
>>>>
>>>> You use pop() and push() to exchange two adjacent items in the 
>>>> list.  Both operations copy the remainder of the list, so they're 
>>>> rather slow.  Since you're exchanging two items in the list, you 
>>>> can simply do that:
>>>>     list[pos1], list[pos2] = list[pos2], list[pos1]
>>>>
>>>> That also eliminates the need for the second try/except.
>>>>
>>>> You mention being bothered by the while loop.  You could replace it 
>>>> with a simple for loop with xrange(len(list_)), since you know that 
>>>> N passes will always be enough.  But if the list is partially 
>>>> sorted, your present scheme will end sooner.  And if it's fully 
>>>> sorted, it'll only take one pass over the data.
>>>>
>>>> There are many refinements you could do.  For example, you don't 
>>>> have to stop the inner loop after the first swap.  You could finish 
>>>> the buffer, swapping any other pairs that are out of order.  You'd 
>>>> then be saving a flag indicating if you did any swaps.  You could 
>>>> keep a index pointing to the last pair you swapped on the previous 
>>>> pass, and use that for a limit next time.  Then you just terminate 
>>>> the outer loop when that limit value is 1.  You could even keep two 
>>>> limit values, and bubble back and forth between them, as they 
>>>> gradually close into the median of the list.  You quit when they 
>>>> collide in the middle.
>>>>
>>>> The resultant function should be much faster for medium-sized 
>>>> lists, but it still will slow down quadratically as the list size 
>>>> increases.  You still need to divide and conquer, and quicksort is 
>>>> just one way of doing that.
>>>>
>>>> DaveA
>>>>
>>> Thanks a lot Dave,
>>>
>>> Sorry the original thread is called 'Python and algorithms'.
>>>
>>> Yes, I think it's best to use what python provides and build on top 
>>> of that. I got to asking my original question based on trying to 
>>> learn more about algorithms in general, through python. Of late many 
>>> people have been asking me how well I can 'build' algorithms, and 
>>> this prompted me to start the thread. This is for learning purposes 
>>> (which the original thread will give you and indication where I'm 
>>> coming from).
>>>
>>> The refactored code looks like this. I have tackled a couple items. 
>>> First the sub-listing (which I'll wait till I can get the full sort 
>>> working), then the last couple of paragraphs about refinements. 
>>> Starting with the first refinement, I'm not sure how *not* to stop 
>>> the inner loop?
>>>
>>> def s2(list_):
>>>    for pos1 in xrange(len(list_)-1):
>>>        item1 = list_[pos1]
>>>        pos2 = pos1 + 1
>>>        item2 = list_[pos2]
>>>        if item1 >= item2:
>>>            list_[pos1], list_[pos2] = list_[pos2], list_[pos1]
>>>            return True
>>>
>>> def mysorter(list_):
>>>    # This is the outer loop?
>>>    while s2(list_) is True:
>>>        # Calling s2 kicks off the inner loop?
>>>        s2(list_)
>>>
>>> if __name__ == '__main__':
>>>    from random import shuffle
>>>    foo = range(10)
>>>    shuffle(foo)
>>>    mysorter(foo)
>>>
>>>
>>> Thanks again.
>>>
>> As before, I'm not actually trying this code, so there may be typos.  
>> But assuming your code here works, the next refinement would be:
>>
>> In s2() function, add a flag variable, initially False.  Then instead 
>> of the return True, just say flag=True
>> Then at the end of the function, return flag
>>
>> About the while loop.  No need to say 'is True'  just use while 
>> s2(list_):  And no need to call s2() a second time.
>>
>> while s2(list_):
>>     pass
> Okay up to here I follow. This all makes sense.
>
> def s2(list_):
>     flag = False
>     for pos1 in xrange(len(list_)-1):
>         item1 = list_[pos1]
>         pos2 = pos1 + 1
>         item2 = list_[pos2]
>         if item1 >= item2:
>             list_[pos1], list_[pos2] = list_[pos2], list_[pos1]
>             flag = True       
>             return flag
>
> def mysorter(list_):
>     while s2(list_):
>         pass
>>
>> Before you can refine the upper limit, you need a way to preserve it 
>> between calls.  Simplest way to do that is to combine the two 
>> functions, as a nested loop.  Then, instead of flag, you can have a 
>> value "limit" which indicates what index was last swapped.  And the 
>> inner loop uses that as an upper limit on its xrange.
> Where I start to get confused is refining the 'upper limit'. What is 
> the upper limit defining? I'm guessing it is the last position processed.
>
> T
>
The upper limit is the index of the last swap you had to make.  That'd 
be either pos1 or pos2, I forget which.  Anyway, the inner loop goes to 
that limit, instead of to the len(_list)  The idea is that anything 
beyond that point is already sorted, and already contains the highest 
elements, so you don't need to visit it again.

This is all from memory, so I think I'm right, but not positive.  It's 
been many years since I had to actually code a sort loop by hand.  
Probably 20 or so.

DaveA