on sorting things
Peter Otten
__peter__ at web.de
Thu Dec 19 03:46:51 EST 2019
Eli the Bearded wrote:
> I recently saw a link to an old post on a blog and then started looking
> at the newer posts. This one:
>
> https://leancrew.com/all-this/2019/11/the-key-to-sorting-in-python/
>
> discusses ways to deal with useful sorting of movie / television show
> titles. Some initial words should be re-ordered for sorting purposes
> (_Adventures of Huckleberry Finn, The_), roman numbers should sort like
> regular numbers (_Rocky V_ comes before _Rocky IX_), and something needs
> to be done about sorting accented vowels.
>
> But what caught my eye most, as someone relatively new to Python but
> with long experience in C in Perl, is sorting doesn't take a
> *comparison* function, it takes a *key generator* function, and that
> function is supposed to transform the thing into something that the
> native comparison knows how to compare.
>
> This seems a strange choice, and I'm wondering if someone can explain
> the benefits of doing it that way to me.
Python 2 started with a comparison function and then grew a key function.
With a key function you still have to compare items, you are just breaking
the comparison into two steps:
- Calculate the key. This can be slow as it has to be done once per item.
- Compare the items. The key is usually a string or tuple, and for those
the comparison runs C code which is a bit faster. The gain is even greater
as the comparison is performed much more often.
The disadvantages of the key-based approach are that some comparison
functions cannot naturally be rewritten in this way and that memory usage is
a bit higher. Both are typically "expert" problems, so the developers
decided to "nudge" users to consider keys first.
Now take a look at this example written in Python 2:
import random
import time
import contextlib
@contextlib.contextmanager
def measure(message):
start = time.time()
yield
span = time.time() - start
print message.format(span)
with open("/usr/share/dict/words") as f:
words = f.read().splitlines()
random.shuffle(words)
def key_lower(a):
global key_calls
key_calls += 1
return a.lower()
def cmp_lower(a, b):
global cmp_calls
cmp_calls += 1
return cmp(a.lower(), b.lower())
key_calls = 0
cmp_calls = 0
with measure("cmp-based sort took {} secs"):
words[:].sort(cmp_lower)
with measure("key-bases sort took {} secs"):
words[:].sort(key=key_lower)
print "cmp called", cmp_calls, "times"
print "key called", key_calls, "times"
$ python2 sort_demo.py
cmp-based sort took 1.50221800804 secs
key-bases sort took 0.293763160706 secs
cmp called 1516111 times
key called 99171 times
More information about the Python-list
mailing list