Guido rethinking removal of cmp from sort method

Antoon Pardon Antoon.Pardon at rece.vub.ac.be
Thu Mar 31 05:41:36 EDT 2011


On Thu, Mar 31, 2011 at 02:13:53AM +0000, Steven D'Aprano wrote:
> On Wed, 30 Mar 2011 11:06:20 +0200, Antoon Pardon wrote:
> 
> > As far as I can see, key will only produce significant speedups, if
> > comparing items can then be completly done internally in the python
> > engine without referencing user python code.
> 
> Incorrect. You don't even need megabytes of data to see significant 
> differences. How about a mere 1000 short strings?
> 
> 
> [steve at wow-wow ~]$ python2.6
> Python 2.6.6 (r266:84292, Dec 21 2010, 18:12:50)
> [GCC 4.1.2 20070925 (Red Hat 4.1.2-27)] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
> >>> from random import shuffle
> >>> data = ['a'*n for n in range(1000)]
> >>> shuffle(data)
> >>> from timeit import Timer
> >>>
> >>> t_key = Timer('sorted(data, key=lambda a: len(a))',
> ... 'from __main__ import data')
> >>> t_cmp = Timer('sorted(data, cmp=lambda a,b: cmp(len(a), len(b)))',
> ... 'from __main__ import data')
> >>>
> >>> min(t_key.repeat(number=1000, repeat=5))
> 0.89357517051696777
> >>> min(t_cmp.repeat(number=1000, repeat=5))
> 7.6032949066162109
> 
> That's almost ten times slower.

But how does it contradict what I wrote above? Maybe I didn't make
myself clear but in your example, the key function produces ints.
With ints the comparison is completely done within the python engine
without the need to defer to user python code (through the rich
comparison functions). So this is the kind of key I described that
would produce a significant speedup.

But once your key produces user objects that need to be compared
through user python code with __lt__ methods, getting a speedup
through the use of key instead of cmp will not be obvious and
in a number of case you will loose speed.

> Of course, the right way to do that specific sort is:
> >>> t_len = Timer('sorted(data, key=len)', 'from __main__ import data')
> >>> min(t_len.repeat(number=1000, repeat=5))
> 0.64559602737426758
> 
> which is even easier and faster. But even comparing a pure Python key 
> function to the cmp function, it's obvious that cmp is nearly always 
> slower.

I don't find that obvious at all. 

> Frankly, trying to argue that cmp is faster, or nearly as fast, is a 
> losing proposition. In my opinion, the only strategy that has even a 
> faint glimmer of hope is to find a convincing use-case where speed does 
> not matter.

I don't argue such a general statement. I just argue there are cases
where it is and I supporeted that statement with numbers. The only
way these numbers could be disputed was by focussing on specific details
that didn't generalize.

Now maybe providing a cmp_to_key function written in C, can reduce
the overhead for such cases as Raymond Hettinger suggested. If it
turns out that that works out, then I would consider this thread
usefull even if that will be the only result of this thread.

Something else the dev team can consider, is a Negation class.
This class would wrap itself around a class or object but reverse the ordering.
So that we would get Negation('a') > Negation('b'). That would make it
easy to create sort keys in which partial keys had to be sorted differently
from each other. This would have to be done by the dev team since I
guess that writing such a thing in Python would loose all the speed
of using a key-function.

> Or, an alternative approach would be for one of the cmp-supporters to 
> take the code for Python's sort routine, and implement your own sort-with-
> cmp (in C, of course, a pure Python solution will likely be unusable) and 
> offer it as a download. For anyone who knows how to do C extensions, this 
> shouldn't be hard: just grab the code in Python 2.7 and make it a stand-
> alone function that can be imported. 
> 
> If you get lots of community interest in this, that is a good sign that 
> the solution is useful and practical, and then you can push to have it 
> included in the standard library or even as a built-in.
> 
> And if not, well, at least you will be able to continue using cmp in your 
> own code.

I'll first let Raymond Hettinger rewrite cmp_to_key in C, as I understand he
suggested, and reevaluate after that.

-- 
Antoon Pardon



More information about the Python-list mailing list