[Tutor] sorting algorithm

Thu Mar 11 13:01:53 CET 2010

Dave Angel wrote:
> (You forgot to do a Reply-All, so your message went to just me, rather 
> than to me and the list )
> 
> 
> C.T. Matsumoto wrote:
>> Dave Angel wrote:
>>> C.T. Matsumoto wrote:
>>>> Hello,
>>>>
>>>> This is follow up on a question I had about algorithms. In the 
>>>> thread it was suggested I make my own sorting algorithm.
>>>>
>>>> Here are my results.
>>>>
>>>> #!/usr/bin/python
>>>>
>>>> def sort_(list_):
>>>>    for item1 in list_:
>>>>        pos1 = list_.index(item1)
>>>>        pos2 = pos1 + 1
>>>>        try:
>>>>            item2 = list_[pos2]
>>>>        except IndexError:
>>>>            pass
>>>>
>>>>        if item1 >= item2:
>>>>            try:
>>>>                list_.pop(pos2)
>>>>                list_.insert(pos1, item2)
>>>>                return True
>>>>            except IndexError:
>>>>                pass
>>>>
>>>> def mysorter(list_):
>>>>    while sort_(list_) is True:
>>>>        sort_(list_)
>>>>
>>>> I found this to be a great exercise. In doing the exercise, I got 
>>>> pretty stuck. I consulted another programmer (my dad) who described 
>>>> how to go about sorting. As it turned out the description he 
>>>> described was the Bubble sort algorithm. Since coding the solution I 
>>>> know the Bubble sort is inefficient because of repeated iterations 
>>>> over the entire list. This shed light on the quick sort algorithm 
>>>> which I'd like to have a go at.
>>>>
>>>> Something I haven't tried is sticking in really large lists. I was 
>>>> told that with really large list you break down the input list into 
>>>> smaller lists. Sort each list, then go back and use the same 
>>>> swapping procedure for each of the different lists. My question is, 
>>>> at what point to you start breaking things up? Is that based on list 
>>>> elements or is it based on memory(?) resources python is using?
>>>>
>>>> One thing I'm not pleased about is the while loop and I'd like to 
>>>> replace it with a for loop.
>>>>
>>>> Thanks,
>>>>
>>>> T
>>>>
>>>>
>>> There are lots of references on the web about Quicksort, including a 
>>> video at:
>>>     http://www.youtube.com/watch?v=y_G9BkAm6B8
>>>
>>> which I think illustrates it pretty well.  It would be a great 
>>> learning exercise to implement Python code directly from that 
>>> description, without using the sample C++ code available.
>>>
>>> (Incidentally, there are lots of variants of Quicksort, so I'm not 
>>> going to quibble about whether this is the "right" one to be called 
>>> that.)
>>>
>>> I don't know what your earlier thread was, since you don't mention 
>>> the subject line, but there are a number of possible reasons you 
>>> might not have wanted to use the built-in sort.  The best one is for 
>>> educational purposes.  I've done my own sort for various reasons in 
>>> the past, even though I had a library function, since the library 
>>> function had some limits.  One time I recall, the situation was that 
>>> the library sort was limited to 64k of total data, and I had to work 
>>> with much larger arrays (this was in 16bit C++, in "large" model).  I 
>>> solved the size problem by using the  C++ sort library on 16k subsets 
>>> (because a pointer was 2*2 bytes).  Then I merged the results of the 
>>> sorts.  At the time, and in the circumstances involved, there were 
>>> seldom more than a dozen or so sublists to merge, so this approach 
>>> worked well enough.
>>>
>>> Generally, it's better for both your development time and the 
>>> efficiency and reliabilty of the end code, to base a new sort 
>>> mechanism on the existing one.  In my case above, I was replacing 
>>> what amounted to an insertion sort, and achieved a 50* improvement 
>>> for a real customer.  It was fast enough that other factors 
>>> completely dominated his running time.
>>>
>>> But for learning purposes?  Great plan.  So now I'll respond to your 
>>> other questions, and comment on your present algorithm.
>>>
>>> It would be useful to understand about algorithmic complexity, the so 
>>> called Order Function.  In a bubble sort, if you double the size of 
>>> the array, you quadruple the number of comparisons and swaps.  It's 
>>> order N-squared or O(n*n).   So what works well for an array of size 
>>> 10 might take a very long time for an array of size 10000 (like a 
>>> million times as long).  You can do much better by sorting smaller 
>>> lists, and then combining them together.  Such an algorithm can  be 
>>> O(n*log(n)).
>>>
>>>
>>> You ask at what point you consider sublists?  In a language like C, 
>>> the answer is when the list is size 3 or more.  For anything larger 
>>> than 2, you divide into sublists, and work on them.
>>>
>>> Now, if I may comment on your code.  You're modifying a list while 
>>> you're iterating through it in a for loop.  In the most general case, 
>>> that's undefined.  I think it's safe in this case, but I would avoid 
>>> it anyway, by just using xrange(len(list_)-1) to iterate through it.  
>>> You use the index function to find something you would already know 
>>> -- the index function is slow.  And the first try/except isn't needed 
>>> if you use a -1 in the xrange argument, as I do above.
>>>
>>> You use pop() and push() to exchange two adjacent items in the list.  
>>> Both operations copy the remainder of the list, so they're rather 
>>> slow.  Since you're exchanging two items in the list, you can simply 
>>> do that:
>>>     list[pos1], list[pos2] = list[pos2], list[pos1]
>>>
>>> That also eliminates the need for the second try/except.
>>>
>>> You mention being bothered by the while loop.  You could replace it 
>>> with a simple for loop with xrange(len(list_)), since you know that N 
>>> passes will always be enough.  But if the list is partially sorted, 
>>> your present scheme will end sooner.  And if it's fully sorted, it'll 
>>> only take one pass over the data.
>>>
>>> There are many refinements you could do.  For example, you don't have 
>>> to stop the inner loop after the first swap.  You could finish the 
>>> buffer, swapping any other pairs that are out of order.  You'd then 
>>> be saving a flag indicating if you did any swaps.  You could keep a 
>>> index pointing to the last pair you swapped on the previous pass, and 
>>> use that for a limit next time.  Then you just terminate the outer 
>>> loop when that limit value is 1.  You could even keep two limit 
>>> values, and bubble back and forth between them, as they gradually 
>>> close into the median of the list.  You quit when they collide in the 
>>> middle.
>>>
>>> The resultant function should be much faster for medium-sized lists, 
>>> but it still will slow down quadratically as the list size 
>>> increases.  You still need to divide and conquer, and quicksort is 
>>> just one way of doing that.
>>>
>>> DaveA
>>>
>> Thanks a lot Dave,
>>
>> Sorry the original thread is called 'Python and algorithms'.
>>
>> Yes, I think it's best to use what python provides and build on top of 
>> that. I got to asking my original question based on trying to learn 
>> more about algorithms in general, through python. Of late many people 
>> have been asking me how well I can 'build' algorithms, and this 
>> prompted me to start the thread. This is for learning purposes (which 
>> the original thread will give you and indication where I'm coming from).
>>
>> The refactored code looks like this. I have tackled a couple items. 
>> First the sub-listing (which I'll wait till I can get the full sort 
>> working), then the last couple of paragraphs about refinements. 
>> Starting with the first refinement, I'm not sure how *not* to stop the 
>> inner loop?
>>
>> def s2(list_):
>>    for pos1 in xrange(len(list_)-1):
>>        item1 = list_[pos1]
>>        pos2 = pos1 + 1
>>        item2 = list_[pos2]
>>        if item1 >= item2:
>>            list_[pos1], list_[pos2] = list_[pos2], list_[pos1]
>>            return True
>>
>> def mysorter(list_):
>>    # This is the outer loop?
>>    while s2(list_) is True:
>>        # Calling s2 kicks off the inner loop?
>>        s2(list_)
>>
>> if __name__ == '__main__':
>>    from random import shuffle
>>    foo = range(10)
>>    shuffle(foo)
>>    mysorter(foo)
>>
>>
>> Thanks again.
>>
> As before, I'm not actually trying this code, so there may be typos.  
> But assuming your code here works, the next refinement would be:
> 
> In s2() function, add a flag variable, initially False.  Then instead of 
> the return True, just say flag=True
> Then at the end of the function, return flag
> 
> About the while loop.  No need to say 'is True'  just use while 
> s2(list_):  And no need to call s2() a second time.
> 
> while s2(list_):
>     pass
Okay up to here I follow. This all makes sense.

def s2(list_):
     flag = False
     for pos1 in xrange(len(list_)-1):
         item1 = list_[pos1]
         pos2 = pos1 + 1
         item2 = list_[pos2]
         if item1 >= item2:
             list_[pos1], list_[pos2] = list_[pos2], list_[pos1]
             flag = True		
             return flag

def mysorter(list_):
     while s2(list_):
         pass
> 
> Before you can refine the upper limit, you need a way to preserve it 
> between calls.  Simplest way to do that is to combine the two functions, 
> as a nested loop.  Then, instead of flag, you can have a value "limit" 
> which indicates what index was last swapped.  And the inner loop uses 
> that as an upper limit on its xrange.
Where I start to get confused is refining the 'upper limit'. What is the 
upper limit defining? I'm guessing it is the last position processed.

T
> 
> DaveA
> 
> _______________________________________________
> Tutor maillist  -  Tutor at python.org
> To unsubscribe or change subscription options:
> http://mail.python.org/mailman/listinfo/tutor
>