[Tutor] sorting algorithm

Fri Mar 12 13:20:55 CET 2010

Dave Angel wrote:
> C.T. Matsumoto wrote:
>> Dave Angel wrote:
>>> (You forgot to do a Reply-All, so your message went to just me, 
>>> rather than to me and the list )
>>>
>>>
>>> C.T. Matsumoto wrote:
>>>> Dave Angel wrote:
>>>>> C.T. Matsumoto wrote:
>>>>>> Hello,
>>>>>>
>>>>>> This is follow up on a question I had about algorithms. In the 
>>>>>> thread it was suggested I make my own sorting algorithm.
>>>>>>
>>>>>> Here are my results.
>>>>>>
>>>>>> #!/usr/bin/python
>>>>>>
>>>>>> def sort_(list_):
>>>>>>    for item1 in list_:
>>>>>>        pos1 = list_.index(item1)
>>>>>>        pos2 = pos1 + 1
>>>>>>        try:
>>>>>>            item2 = list_[pos2]
>>>>>>        except IndexError:
>>>>>>            pass
>>>>>>
>>>>>>        if item1 >= item2:
>>>>>>            try:
>>>>>>                list_.pop(pos2)
>>>>>>                list_.insert(pos1, item2)
>>>>>>                return True
>>>>>>            except IndexError:
>>>>>>                pass
>>>>>>
>>>>>> def mysorter(list_):
>>>>>>    while sort_(list_) is True:
>>>>>>        sort_(list_)
>>>>>>
>>>>>> I found this to be a great exercise. In doing the exercise, I got 
>>>>>> pretty stuck. I consulted another programmer (my dad) who 
>>>>>> described how to go about sorting. As it turned out the 
>>>>>> description he described was the Bubble sort algorithm. Since 
>>>>>> coding the solution I know the Bubble sort is inefficient because 
>>>>>> of repeated iterations over the entire list. This shed light on 
>>>>>> the quick sort algorithm which I'd like to have a go at.
>>>>>>
>>>>>> Something I haven't tried is sticking in really large lists. I was 
>>>>>> told that with really large list you break down the input list 
>>>>>> into smaller lists. Sort each list, then go back and use the same 
>>>>>> swapping procedure for each of the different lists. My question 
>>>>>> is, at what point to you start breaking things up? Is that based 
>>>>>> on list elements or is it based on memory(?) resources python is 
>>>>>> using?
>>>>>>
>>>>>> One thing I'm not pleased about is the while loop and I'd like to 
>>>>>> replace it with a for loop.
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> T
>>>>>>
>>>>>>
>>>>> There are lots of references on the web about Quicksort, including 
>>>>> a video at:
>>>>>     http://www.youtube.com/watch?v=y_G9BkAm6B8
>>>>>
>>>>> which I think illustrates it pretty well.  It would be a great 
>>>>> learning exercise to implement Python code directly from that 
>>>>> description, without using the sample C++ code available.
>>>>>
>>>>> (Incidentally, there are lots of variants of Quicksort, so I'm not 
>>>>> going to quibble about whether this is the "right" one to be called 
>>>>> that.)
>>>>>
>>>>> I don't know what your earlier thread was, since you don't mention 
>>>>> the subject line, but there are a number of possible reasons you 
>>>>> might not have wanted to use the built-in sort.  The best one is 
>>>>> for educational purposes.  I've done my own sort for various 
>>>>> reasons in the past, even though I had a library function, since 
>>>>> the library function had some limits.  One time I recall, the 
>>>>> situation was that the library sort was limited to 64k of total 
>>>>> data, and I had to work with much larger arrays (this was in 16bit 
>>>>> C++, in "large" model).  I solved the size problem by using the  
>>>>> C++ sort library on 16k subsets (because a pointer was 2*2 bytes).  
>>>>> Then I merged the results of the sorts.  At the time, and in the 
>>>>> circumstances involved, there were seldom more than a dozen or so 
>>>>> sublists to merge, so this approach worked well enough.
>>>>>
>>>>> Generally, it's better for both your development time and the 
>>>>> efficiency and reliabilty of the end code, to base a new sort 
>>>>> mechanism on the existing one.  In my case above, I was replacing 
>>>>> what amounted to an insertion sort, and achieved a 50* improvement 
>>>>> for a real customer.  It was fast enough that other factors 
>>>>> completely dominated his running time.
>>>>>
>>>>> But for learning purposes?  Great plan.  So now I'll respond to 
>>>>> your other questions, and comment on your present algorithm.
>>>>>
>>>>> It would be useful to understand about algorithmic complexity, the 
>>>>> so called Order Function.  In a bubble sort, if you double the size 
>>>>> of the array, you quadruple the number of comparisons and swaps.  
>>>>> It's order N-squared or O(n*n).   So what works well for an array 
>>>>> of size 10 might take a very long time for an array of size 10000 
>>>>> (like a million times as long).  You can do much better by sorting 
>>>>> smaller lists, and then combining them together.  Such an algorithm 
>>>>> can  be O(n*log(n)).
>>>>>
>>>>>
>>>>> You ask at what point you consider sublists?  In a language like C, 
>>>>> the answer is when the list is size 3 or more.  For anything larger 
>>>>> than 2, you divide into sublists, and work on them.
>>>>>
>>>>> Now, if I may comment on your code.  You're modifying a list while 
>>>>> you're iterating through it in a for loop.  In the most general 
>>>>> case, that's undefined.  I think it's safe in this case, but I 
>>>>> would avoid it anyway, by just using xrange(len(list_)-1) to 
>>>>> iterate through it.  You use the index function to find something 
>>>>> you would already know -- the index function is slow.  And the 
>>>>> first try/except isn't needed if you use a -1 in the xrange 
>>>>> argument, as I do above.
>>>>>
>>>>> You use pop() and push() to exchange two adjacent items in the 
>>>>> list.  Both operations copy the remainder of the list, so they're 
>>>>> rather slow.  Since you're exchanging two items in the list, you 
>>>>> can simply do that:
>>>>>     list[pos1], list[pos2] = list[pos2], list[pos1]
>>>>>
>>>>> That also eliminates the need for the second try/except.
>>>>>
>>>>> You mention being bothered by the while loop.  You could replace it 
>>>>> with a simple for loop with xrange(len(list_)), since you know that 
>>>>> N passes will always be enough.  But if the list is partially 
>>>>> sorted, your present scheme will end sooner.  And if it's fully 
>>>>> sorted, it'll only take one pass over the data.
>>>>>
>>>>> There are many refinements you could do.  For example, you don't 
>>>>> have to stop the inner loop after the first swap.  You could finish 
>>>>> the buffer, swapping any other pairs that are out of order.  You'd 
>>>>> then be saving a flag indicating if you did any swaps.  You could 
>>>>> keep a index pointing to the last pair you swapped on the previous 
>>>>> pass, and use that for a limit next time.  Then you just terminate 
>>>>> the outer loop when that limit value is 1.  You could even keep two 
>>>>> limit values, and bubble back and forth between them, as they 
>>>>> gradually close into the median of the list.  You quit when they 
>>>>> collide in the middle.
>>>>>
>>>>> The resultant function should be much faster for medium-sized 
>>>>> lists, but it still will slow down quadratically as the list size 
>>>>> increases.  You still need to divide and conquer, and quicksort is 
>>>>> just one way of doing that.
>>>>>
>>>>> DaveA
>>>>>
>>>> Thanks a lot Dave,
>>>>
>>>> Sorry the original thread is called 'Python and algorithms'.
>>>>
>>>> Yes, I think it's best to use what python provides and build on top 
>>>> of that. I got to asking my original question based on trying to 
>>>> learn more about algorithms in general, through python. Of late many 
>>>> people have been asking me how well I can 'build' algorithms, and 
>>>> this prompted me to start the thread. This is for learning purposes 
>>>> (which the original thread will give you and indication where I'm 
>>>> coming from).
>>>>
>>>> The refactored code looks like this. I have tackled a couple items. 
>>>> First the sub-listing (which I'll wait till I can get the full sort 
>>>> working), then the last couple of paragraphs about refinements. 
>>>> Starting with the first refinement, I'm not sure how *not* to stop 
>>>> the inner loop?
>>>>
>>>> def s2(list_):
>>>>    for pos1 in xrange(len(list_)-1):
>>>>        item1 = list_[pos1]
>>>>        pos2 = pos1 + 1
>>>>        item2 = list_[pos2]
>>>>        if item1 >= item2:
>>>>            list_[pos1], list_[pos2] = list_[pos2], list_[pos1]
>>>>            return True
>>>>
>>>> def mysorter(list_):
>>>>    # This is the outer loop?
>>>>    while s2(list_) is True:
>>>>        # Calling s2 kicks off the inner loop?
>>>>        s2(list_)
>>>>
>>>> if __name__ == '__main__':
>>>>    from random import shuffle
>>>>    foo = range(10)
>>>>    shuffle(foo)
>>>>    mysorter(foo)
>>>>
>>>>
>>>> Thanks again.
>>>>
>>> As before, I'm not actually trying this code, so there may be typos.  
>>> But assuming your code here works, the next refinement would be:
>>>
>>> In s2() function, add a flag variable, initially False.  Then instead 
>>> of the return True, just say flag=True
>>> Then at the end of the function, return flag
>>>
>>> About the while loop.  No need to say 'is True'  just use while 
>>> s2(list_):  And no need to call s2() a second time.
>>>
>>> while s2(list_):
>>>     pass
>> Okay up to here I follow. This all makes sense.
>>
>> def s2(list_):
>>     flag = False
>>     for pos1 in xrange(len(list_)-1):
>>         item1 = list_[pos1]
>>         pos2 = pos1 + 1
>>         item2 = list_[pos2]
>>         if item1 >= item2:
>>             list_[pos1], list_[pos2] = list_[pos2], list_[pos1]
>>             flag = True                   return flag
>>
>> def mysorter(list_):
>>     while s2(list_):
>>         pass
>>>
>>> Before you can refine the upper limit, you need a way to preserve it 
>>> between calls.  Simplest way to do that is to combine the two 
>>> functions, as a nested loop.  Then, instead of flag, you can have a 
>>> value "limit" which indicates what index was last swapped.  And the 
>>> inner loop uses that as an upper limit on its xrange.
>> Where I start to get confused is refining the 'upper limit'. What is 
>> the upper limit defining? I'm guessing it is the last position processed.
>>
>> T
>>
> The upper limit is the index of the last swap you had to make.  That'd 
> be either pos1 or pos2, I forget which.  Anyway, the inner loop goes to 
> that limit, instead of to the len(_list)  The idea is that anything 
> beyond that point is already sorted, and already contains the highest 
> elements, so you don't need to visit it again.
> 
> This is all from memory, so I think I'm right, but not positive.  It's 
> been many years since I had to actually code a sort loop by hand.  
> Probably 20 or so.
> 
> DaveA
> 
> 
Alright, I got the idea of the upper limit and feeding it to xrange, 
which can take a start argument. The inner loop code looks like this:

while <not sure>:
     for pos1 in xrange(limit, len(list_)-1):
         pos2 = pos1 + 1
         item1 = list_[pos1]
         item2 = list_[pos2]
         if item1 > item2:
             list_[pos1], list_[pos2] = list_[pos2], list_[pos1]
             limit = pos2
             return limit

what is the initial value of limit? and what does while use as a truth 
value?

BTW thanks for looking back, much appreciated.

T