Ordering python sets

Steven D'Aprano steve at REMOVE-THIS-cybersource.com.au
Tue Oct 28 09:40:41 EDT 2008


On Mon, 27 Oct 2008 17:03:58 -0700, Glenn Linderman wrote:

>>> A little harder question is how to create a key that corresponds to
>>> ascending string followed by descending string?
>>>
>>> To do that you can sort the data two times, relying on the stable
>>> nature of the Python sort.
>>>     
>>>     
> Ick. Costs twice as much.

But that's only a constant cost. For small amounts of data, it's trivial, 
and for large amounts, it's dominated by the cost of sorting N items, 
which is O(N log N) by memory. (Python's sort is a modified heap sort, I 
believe.)


[...]
> This solution seems to be the solution I was looking for.  I think "Neg"
> is a poor name, but all the good names are significantly longer... Given
> its usage, though, in a sorted call, "Rev" would be a better poor name!
> ReversedCompare might be a better long name.
> 
> How can I suggest adding it to the documentation for Python 3.0 ?  It is
> much preferable to the "obvious" solution of sorting twice, and doing
> that is a trap which many people might fall into.  I posted here
> instead, and thank you for the solution.


Why do you call it a trap? You're preferred solution is significantly 
slower for small amounts of data than the solution you call a trap:

class Neg(object):
    def __init__(self, value):
        self.value = value
    def __cmp__(self, other):
        return -cmp(self.value, other.value)

def sort_updown1(pairs):
    return sorted(pairs, key=lambda (s, t): (s, Neg(t)))

import operator
def sort_updown2(pairs):
    pairs = pairs[:]
    pairs.sort(key=operator.itemgetter(1), reverse=True)
    pairs.sort(key=operator.itemgetter(0))
    return pairs

pairs = [('al', 'boo'), ('zul', 'ao'), ('al', 'den'), ('zul', 'o')]
from timeit import Timer
t1 = Timer("sort_updown1(pairs)", 
    "from __main__ import sort_updown1, pairs")
t2 = Timer("sort_updown2(pairs)", 
    "from __main__ import sort_updown2, pairs")


And the results:

>>> t1.repeat(number=1000)
[0.053807973861694336, 0.053076028823852539, 0.052966117858886719]
>>> t2.repeat(number=1000)
[0.022480964660644531, 0.022369861602783203, 0.02264404296875]

The difference is even more significant for larger amounts of data:

>>> pairs = pairs*100
>>> t1.repeat(number=1000)
[5.3583409786224365, 4.6985390186309814, 4.6953370571136475]
>>> t2.repeat(number=1000)
[1.1064291000366211, 1.0017910003662109, 0.48421788215637207]

And more significant still as we sort even more data:

>>> pairs = pairs*10
>>> t1.repeat(number=1000)
[53.255411863327026, 53.792617082595825, 53.549386024475098]
>>> t2.repeat(number=1000)
[5.8788919448852539, 5.0701780319213867, 4.8528299331665039]


Or, tabulating the results:

N  = 4      400    4000
------------------------
t1 = 0.05   4.6    53.3
t2 = 0.02   0.5     4.9

As you can see, the supposed "trap" of sorting twice is significantly 
faster, by an order of magnitude, than sorting it once. It would be even 
faster if I didn't bother making a copy of the list before sorting.



-- 
Steven



More information about the Python-list mailing list