[Tutor] sorting algorithm

Dave Angel davea at ieee.org
Wed Mar 3 18:56:41 CET 2010

```(You forgot to do a Reply-All, so your message went to just me, rather
than to me and the list )

C.T. Matsumoto wrote:
> Dave Angel wrote:
>> C.T. Matsumoto wrote:
>>> Hello,
>>>
>>> thread it was suggested I make my own sorting algorithm.
>>>
>>> Here are my results.
>>>
>>> #!/usr/bin/python
>>>
>>> def sort_(list_):
>>>    for item1 in list_:
>>>        pos1 = list_.index(item1)
>>>        pos2 = pos1 + 1
>>>        try:
>>>            item2 = list_[pos2]
>>>        except IndexError:
>>>            pass
>>>
>>>        if item1 >= item2:
>>>            try:
>>>                list_.pop(pos2)
>>>                list_.insert(pos1, item2)
>>>                return True
>>>            except IndexError:
>>>                pass
>>>
>>> def mysorter(list_):
>>>    while sort_(list_) is True:
>>>        sort_(list_)
>>>
>>> I found this to be a great exercise. In doing the exercise, I got
>>> pretty stuck. I consulted another programmer (my dad) who described
>>> how to go about sorting. As it turned out the description he
>>> described was the Bubble sort algorithm. Since coding the solution I
>>> know the Bubble sort is inefficient because of repeated iterations
>>> over the entire list. This shed light on the quick sort algorithm
>>> which I'd like to have a go at.
>>>
>>> Something I haven't tried is sticking in really large lists. I was
>>> told that with really large list you break down the input list into
>>> smaller lists. Sort each list, then go back and use the same
>>> swapping procedure for each of the different lists. My question is,
>>> at what point to you start breaking things up? Is that based on list
>>> elements or is it based on memory(?) resources python is using?
>>>
>>> One thing I'm not pleased about is the while loop and I'd like to
>>> replace it with a for loop.
>>>
>>> Thanks,
>>>
>>> T
>>>
>>>
>> There are lots of references on the web about Quicksort, including a
>> video at:
>>
>> which I think illustrates it pretty well.  It would be a great
>> learning exercise to implement Python code directly from that
>> description, without using the sample C++ code available.
>>
>> (Incidentally, there are lots of variants of Quicksort, so I'm not
>> going to quibble about whether this is the "right" one to be called
>> that.)
>>
>> I don't know what your earlier thread was, since you don't mention
>> the subject line, but there are a number of possible reasons you
>> might not have wanted to use the built-in sort.  The best one is for
>> educational purposes.  I've done my own sort for various reasons in
>> the past, even though I had a library function, since the library
>> function had some limits.  One time I recall, the situation was that
>> the library sort was limited to 64k of total data, and I had to work
>> with much larger arrays (this was in 16bit C++, in "large" model).  I
>> solved the size problem by using the  C++ sort library on 16k subsets
>> (because a pointer was 2*2 bytes).  Then I merged the results of the
>> sorts.  At the time, and in the circumstances involved, there were
>> seldom more than a dozen or so sublists to merge, so this approach
>> worked well enough.
>>
>> Generally, it's better for both your development time and the
>> efficiency and reliabilty of the end code, to base a new sort
>> mechanism on the existing one.  In my case above, I was replacing
>> what amounted to an insertion sort, and achieved a 50* improvement
>> for a real customer.  It was fast enough that other factors
>> completely dominated his running time.
>>
>> But for learning purposes?  Great plan.  So now I'll respond to your
>> other questions, and comment on your present algorithm.
>>
>> It would be useful to understand about algorithmic complexity, the so
>> called Order Function.  In a bubble sort, if you double the size of
>> the array, you quadruple the number of comparisons and swaps.  It's
>> order N-squared or O(n*n).   So what works well for an array of size
>> 10 might take a very long time for an array of size 10000 (like a
>> million times as long).  You can do much better by sorting smaller
>> lists, and then combining them together.  Such an algorithm can  be
>> O(n*log(n)).
>>
>>
>> You ask at what point you consider sublists?  In a language like C,
>> the answer is when the list is size 3 or more.  For anything larger
>> than 2, you divide into sublists, and work on them.
>>
>> Now, if I may comment on your code.  You're modifying a list while
>> you're iterating through it in a for loop.  In the most general case,
>> that's undefined.  I think it's safe in this case, but I would avoid
>> it anyway, by just using xrange(len(list_)-1) to iterate through it.
>> You use the index function to find something you would already know
>> -- the index function is slow.  And the first try/except isn't needed
>> if you use a -1 in the xrange argument, as I do above.
>>
>> You use pop() and push() to exchange two adjacent items in the list.
>> Both operations copy the remainder of the list, so they're rather
>> slow.  Since you're exchanging two items in the list, you can simply
>> do that:
>>     list[pos1], list[pos2] = list[pos2], list[pos1]
>>
>> That also eliminates the need for the second try/except.
>>
>> You mention being bothered by the while loop.  You could replace it
>> with a simple for loop with xrange(len(list_)), since you know that N
>> passes will always be enough.  But if the list is partially sorted,
>> your present scheme will end sooner.  And if it's fully sorted, it'll
>> only take one pass over the data.
>>
>> There are many refinements you could do.  For example, you don't have
>> to stop the inner loop after the first swap.  You could finish the
>> buffer, swapping any other pairs that are out of order.  You'd then
>> be saving a flag indicating if you did any swaps.  You could keep a
>> index pointing to the last pair you swapped on the previous pass, and
>> use that for a limit next time.  Then you just terminate the outer
>> loop when that limit value is 1.  You could even keep two limit
>> values, and bubble back and forth between them, as they gradually
>> close into the median of the list.  You quit when they collide in the
>> middle.
>>
>> The resultant function should be much faster for medium-sized lists,
>> but it still will slow down quadratically as the list size
>> increases.  You still need to divide and conquer, and quicksort is
>> just one way of doing that.
>>
>> DaveA
>>
> Thanks a lot Dave,
>
> Sorry the original thread is called 'Python and algorithms'.
>
> Yes, I think it's best to use what python provides and build on top of
> that. I got to asking my original question based on trying to learn
> more about algorithms in general, through python. Of late many people
> have been asking me how well I can 'build' algorithms, and this
> prompted me to start the thread. This is for learning purposes (which
> the original thread will give you and indication where I'm coming from).
>
> The refactored code looks like this. I have tackled a couple items.
> First the sub-listing (which I'll wait till I can get the full sort
> working), then the last couple of paragraphs about refinements.
> Starting with the first refinement, I'm not sure how *not* to stop the
> inner loop?
>
> def s2(list_):
>    for pos1 in xrange(len(list_)-1):
>        item1 = list_[pos1]
>        pos2 = pos1 + 1
>        item2 = list_[pos2]
>        if item1 >= item2:
>            list_[pos1], list_[pos2] = list_[pos2], list_[pos1]
>            return True
>
> def mysorter(list_):
>    # This is the outer loop?
>    while s2(list_) is True:
>        # Calling s2 kicks off the inner loop?
>        s2(list_)
>
> if __name__ == '__main__':
>    from random import shuffle
>    foo = range(10)
>    shuffle(foo)
>    mysorter(foo)
>
>
> Thanks again.
>
As before, I'm not actually trying this code, so there may be typos.
But assuming your code here works, the next refinement would be:

In s2() function, add a flag variable, initially False.  Then instead of
the return True, just say flag=True
Then at the end of the function, return flag

About the while loop.  No need to say 'is True'  just use while
s2(list_):  And no need to call s2() a second time.

while s2(list_):
pass

Before you can refine the upper limit, you need a way to preserve it
between calls.  Simplest way to do that is to combine the two functions,
as a nested loop.  Then, instead of flag, you can have a value "limit"
which indicates what index was last swapped.  And the inner loop uses
that as an upper limit on its xrange.

DaveA

```