[Tutor] sorting algorithm
C.T. Matsumoto
c.t.matsumoto at gmail.com
Fri Mar 12 13:20:55 CET 2010
Dave Angel wrote:
> C.T. Matsumoto wrote:
>> Dave Angel wrote:
>>> (You forgot to do a Reply-All, so your message went to just me,
>>> rather than to me and the list )
>>>
>>>
>>> C.T. Matsumoto wrote:
>>>> Dave Angel wrote:
>>>>> C.T. Matsumoto wrote:
>>>>>> Hello,
>>>>>>
>>>>>> This is follow up on a question I had about algorithms. In the
>>>>>> thread it was suggested I make my own sorting algorithm.
>>>>>>
>>>>>> Here are my results.
>>>>>>
>>>>>> #!/usr/bin/python
>>>>>>
>>>>>> def sort_(list_):
>>>>>> for item1 in list_:
>>>>>> pos1 = list_.index(item1)
>>>>>> pos2 = pos1 + 1
>>>>>> try:
>>>>>> item2 = list_[pos2]
>>>>>> except IndexError:
>>>>>> pass
>>>>>>
>>>>>> if item1 >= item2:
>>>>>> try:
>>>>>> list_.pop(pos2)
>>>>>> list_.insert(pos1, item2)
>>>>>> return True
>>>>>> except IndexError:
>>>>>> pass
>>>>>>
>>>>>> def mysorter(list_):
>>>>>> while sort_(list_) is True:
>>>>>> sort_(list_)
>>>>>>
>>>>>> I found this to be a great exercise. In doing the exercise, I got
>>>>>> pretty stuck. I consulted another programmer (my dad) who
>>>>>> described how to go about sorting. As it turned out the
>>>>>> description he described was the Bubble sort algorithm. Since
>>>>>> coding the solution I know the Bubble sort is inefficient because
>>>>>> of repeated iterations over the entire list. This shed light on
>>>>>> the quick sort algorithm which I'd like to have a go at.
>>>>>>
>>>>>> Something I haven't tried is sticking in really large lists. I was
>>>>>> told that with really large list you break down the input list
>>>>>> into smaller lists. Sort each list, then go back and use the same
>>>>>> swapping procedure for each of the different lists. My question
>>>>>> is, at what point to you start breaking things up? Is that based
>>>>>> on list elements or is it based on memory(?) resources python is
>>>>>> using?
>>>>>>
>>>>>> One thing I'm not pleased about is the while loop and I'd like to
>>>>>> replace it with a for loop.
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> T
>>>>>>
>>>>>>
>>>>> There are lots of references on the web about Quicksort, including
>>>>> a video at:
>>>>> http://www.youtube.com/watch?v=y_G9BkAm6B8
>>>>>
>>>>> which I think illustrates it pretty well. It would be a great
>>>>> learning exercise to implement Python code directly from that
>>>>> description, without using the sample C++ code available.
>>>>>
>>>>> (Incidentally, there are lots of variants of Quicksort, so I'm not
>>>>> going to quibble about whether this is the "right" one to be called
>>>>> that.)
>>>>>
>>>>> I don't know what your earlier thread was, since you don't mention
>>>>> the subject line, but there are a number of possible reasons you
>>>>> might not have wanted to use the built-in sort. The best one is
>>>>> for educational purposes. I've done my own sort for various
>>>>> reasons in the past, even though I had a library function, since
>>>>> the library function had some limits. One time I recall, the
>>>>> situation was that the library sort was limited to 64k of total
>>>>> data, and I had to work with much larger arrays (this was in 16bit
>>>>> C++, in "large" model). I solved the size problem by using the
>>>>> C++ sort library on 16k subsets (because a pointer was 2*2 bytes).
>>>>> Then I merged the results of the sorts. At the time, and in the
>>>>> circumstances involved, there were seldom more than a dozen or so
>>>>> sublists to merge, so this approach worked well enough.
>>>>>
>>>>> Generally, it's better for both your development time and the
>>>>> efficiency and reliabilty of the end code, to base a new sort
>>>>> mechanism on the existing one. In my case above, I was replacing
>>>>> what amounted to an insertion sort, and achieved a 50* improvement
>>>>> for a real customer. It was fast enough that other factors
>>>>> completely dominated his running time.
>>>>>
>>>>> But for learning purposes? Great plan. So now I'll respond to
>>>>> your other questions, and comment on your present algorithm.
>>>>>
>>>>> It would be useful to understand about algorithmic complexity, the
>>>>> so called Order Function. In a bubble sort, if you double the size
>>>>> of the array, you quadruple the number of comparisons and swaps.
>>>>> It's order N-squared or O(n*n). So what works well for an array
>>>>> of size 10 might take a very long time for an array of size 10000
>>>>> (like a million times as long). You can do much better by sorting
>>>>> smaller lists, and then combining them together. Such an algorithm
>>>>> can be O(n*log(n)).
>>>>>
>>>>>
>>>>> You ask at what point you consider sublists? In a language like C,
>>>>> the answer is when the list is size 3 or more. For anything larger
>>>>> than 2, you divide into sublists, and work on them.
>>>>>
>>>>> Now, if I may comment on your code. You're modifying a list while
>>>>> you're iterating through it in a for loop. In the most general
>>>>> case, that's undefined. I think it's safe in this case, but I
>>>>> would avoid it anyway, by just using xrange(len(list_)-1) to
>>>>> iterate through it. You use the index function to find something
>>>>> you would already know -- the index function is slow. And the
>>>>> first try/except isn't needed if you use a -1 in the xrange
>>>>> argument, as I do above.
>>>>>
>>>>> You use pop() and push() to exchange two adjacent items in the
>>>>> list. Both operations copy the remainder of the list, so they're
>>>>> rather slow. Since you're exchanging two items in the list, you
>>>>> can simply do that:
>>>>> list[pos1], list[pos2] = list[pos2], list[pos1]
>>>>>
>>>>> That also eliminates the need for the second try/except.
>>>>>
>>>>> You mention being bothered by the while loop. You could replace it
>>>>> with a simple for loop with xrange(len(list_)), since you know that
>>>>> N passes will always be enough. But if the list is partially
>>>>> sorted, your present scheme will end sooner. And if it's fully
>>>>> sorted, it'll only take one pass over the data.
>>>>>
>>>>> There are many refinements you could do. For example, you don't
>>>>> have to stop the inner loop after the first swap. You could finish
>>>>> the buffer, swapping any other pairs that are out of order. You'd
>>>>> then be saving a flag indicating if you did any swaps. You could
>>>>> keep a index pointing to the last pair you swapped on the previous
>>>>> pass, and use that for a limit next time. Then you just terminate
>>>>> the outer loop when that limit value is 1. You could even keep two
>>>>> limit values, and bubble back and forth between them, as they
>>>>> gradually close into the median of the list. You quit when they
>>>>> collide in the middle.
>>>>>
>>>>> The resultant function should be much faster for medium-sized
>>>>> lists, but it still will slow down quadratically as the list size
>>>>> increases. You still need to divide and conquer, and quicksort is
>>>>> just one way of doing that.
>>>>>
>>>>> DaveA
>>>>>
>>>> Thanks a lot Dave,
>>>>
>>>> Sorry the original thread is called 'Python and algorithms'.
>>>>
>>>> Yes, I think it's best to use what python provides and build on top
>>>> of that. I got to asking my original question based on trying to
>>>> learn more about algorithms in general, through python. Of late many
>>>> people have been asking me how well I can 'build' algorithms, and
>>>> this prompted me to start the thread. This is for learning purposes
>>>> (which the original thread will give you and indication where I'm
>>>> coming from).
>>>>
>>>> The refactored code looks like this. I have tackled a couple items.
>>>> First the sub-listing (which I'll wait till I can get the full sort
>>>> working), then the last couple of paragraphs about refinements.
>>>> Starting with the first refinement, I'm not sure how *not* to stop
>>>> the inner loop?
>>>>
>>>> def s2(list_):
>>>> for pos1 in xrange(len(list_)-1):
>>>> item1 = list_[pos1]
>>>> pos2 = pos1 + 1
>>>> item2 = list_[pos2]
>>>> if item1 >= item2:
>>>> list_[pos1], list_[pos2] = list_[pos2], list_[pos1]
>>>> return True
>>>>
>>>> def mysorter(list_):
>>>> # This is the outer loop?
>>>> while s2(list_) is True:
>>>> # Calling s2 kicks off the inner loop?
>>>> s2(list_)
>>>>
>>>> if __name__ == '__main__':
>>>> from random import shuffle
>>>> foo = range(10)
>>>> shuffle(foo)
>>>> mysorter(foo)
>>>>
>>>>
>>>> Thanks again.
>>>>
>>> As before, I'm not actually trying this code, so there may be typos.
>>> But assuming your code here works, the next refinement would be:
>>>
>>> In s2() function, add a flag variable, initially False. Then instead
>>> of the return True, just say flag=True
>>> Then at the end of the function, return flag
>>>
>>> About the while loop. No need to say 'is True' just use while
>>> s2(list_): And no need to call s2() a second time.
>>>
>>> while s2(list_):
>>> pass
>> Okay up to here I follow. This all makes sense.
>>
>> def s2(list_):
>> flag = False
>> for pos1 in xrange(len(list_)-1):
>> item1 = list_[pos1]
>> pos2 = pos1 + 1
>> item2 = list_[pos2]
>> if item1 >= item2:
>> list_[pos1], list_[pos2] = list_[pos2], list_[pos1]
>> flag = True return flag
>>
>> def mysorter(list_):
>> while s2(list_):
>> pass
>>>
>>> Before you can refine the upper limit, you need a way to preserve it
>>> between calls. Simplest way to do that is to combine the two
>>> functions, as a nested loop. Then, instead of flag, you can have a
>>> value "limit" which indicates what index was last swapped. And the
>>> inner loop uses that as an upper limit on its xrange.
>> Where I start to get confused is refining the 'upper limit'. What is
>> the upper limit defining? I'm guessing it is the last position processed.
>>
>> T
>>
> The upper limit is the index of the last swap you had to make. That'd
> be either pos1 or pos2, I forget which. Anyway, the inner loop goes to
> that limit, instead of to the len(_list) The idea is that anything
> beyond that point is already sorted, and already contains the highest
> elements, so you don't need to visit it again.
>
> This is all from memory, so I think I'm right, but not positive. It's
> been many years since I had to actually code a sort loop by hand.
> Probably 20 or so.
>
> DaveA
>
>
Alright, I got the idea of the upper limit and feeding it to xrange,
which can take a start argument. The inner loop code looks like this:
while <not sure>:
for pos1 in xrange(limit, len(list_)-1):
pos2 = pos1 + 1
item1 = list_[pos1]
item2 = list_[pos2]
if item1 > item2:
list_[pos1], list_[pos2] = list_[pos2], list_[pos1]
limit = pos2
return limit
what is the initial value of limit? and what does while use as a truth
value?
BTW thanks for looking back, much appreciated.
T
More information about the Tutor
mailing list