on sorting things
Peter Otten
__peter__ at web.de
Fri Dec 20 13:01:34 EST 2019
Eli the Bearded wrote:
> In comp.lang.python, Peter Otten <__peter__ at web.de> wrote:
>> Eli the Bearded wrote:
>>> But what caught my eye most, as someone relatively new to Python but
>>> with long experience in C in Perl, is sorting doesn't take a
>
> s/C in /C and/
>
> Ugh.
>
>>> *comparison* function, it takes a *key generator* function, and that
>>> function is supposed to transform the thing into something that the
>>> native comparison knows how to compare.
>>>
>>> This seems a strange choice, and I'm wondering if someone can explain
>>> the benefits of doing it that way to me.
>>
>> Python 2 started with a comparison function and then grew a key function.
>> With a key function you still have to compare items, you are just
>> breaking the comparison into two steps:
>
> [snip]
>
> Thanks for that good explanation. The benchmark comparison makes it
> very thorough.
>
> In my mind I gravitate towards the complicated sorts of sort that can be
> quickly compared for some sorts of keys and not as quickly for others.
>
> Consider a sort that first compares file size and if the same number of
> bytes, then compares file checksum. Any decently scaled real world
> implementation would memoize the checksum for speed, but only work it out
> for files that do not have a unique file size. The key method requires
> it worked out in advance for everything.
Oscar already mentioned the functools.cmp_to_key decorator which makes this
a non-issue:
def mycmp(a, b): ...
files.sort(key=cmp_to_key(mycmp))
Applied to your example, with memoization:
# untested
def cmp(a, b):
return (a > b)-(a < b)
def make_file_key():
size = functools.lru_cache(None)(getsize)
checksum = functools.lru_cache(None)(getchecksum)
@functools.cmp_to_key
def file_key(a, b):
return cmp(size(a), size(b)) or cmp(checksum(a), checksum(b))
return file_key
files.sort(key=make_file_key())
> But I see the key method handles the memoization under the hood for you,
> so those simpler, more common sorts of sort get an easy to see benefit.
>
> Elijah
> ------
> even memoizing the stat() calls would help for large lists
PS: If you are sorting files by size and checksum as part of a deduplication
effort consider using dict-s instead:
def grouped(items, key):
result = defaultdict(list)
for item in items:
result[key(item)].append(item)
return result
for same_size in grouped(files, key=getsize).values():
if len(same_size) > 1:
for same_checksum in grouped(same_size, key=getchecksum).values():
if len(same_checksum) > 1:
print(same_checksum)
More information about the Python-list
mailing list