[Python-ideas] a sorting protocol dunder method?

Mon Dec 4 20:06:02 EST 2017

wow! a few time zones (and a day job) really make a difference to taking
part in a discussion :-)

Thanks for all the feedback. From what I can tell, we don't have a
consensus, though It's looking pretty unlikely that this is going to be
adopted (though maybe the decorator idea). But I'm going to go through and
try to summarize what we (I?) have learned, and what does seem to be agreed
upon, or not. If I misinterpret something you said, or take a quote out of
context, please correct it, but trust that I didn't do it on purpose....

Also, it's kind of a pain to do a digest like this and properly attribute
everyone, so mostly I'm not going to attribute the quotes...

So: has this already been brought up and rejected?
>

https://bugs.python.org/issue20632

Thanks Serhiy -- I didn't think to look in the Bug DB. This does remind me
that it would be good to have a place (maybe in a mets-PEP?) to put topics
that have been discussed, but didn't get as far as anyone writing a PEP...

An existence proof: in NLTK, an __lt__ method added purely to facilitate
> consistent sorting (in doctests) of structured data objects for which
> comparison operators do not really make conceptual sense:
> https://github.com/nltk/nltk/pull/1902/files#diff-
> 454368f06fd635b1e06c9bb6d65bd19bR689
> Granted, calling min() and max() on collections of these objects would
> not make conceptual sense either. Still, __sort_key__ would have been
> cleaner than __lt__.

So nice to know I'm not the only one that wants (needs) to provide a sort
order be default -- though it doesn't mean it's not a niche issue anyway.

By default, we sort by the inherent order of the values. But if the
> values have no inherent order (they are unordered), we can sort
> unordered items in a collection by providing an appropriate key
> function. Hence why I say they are loosely related.

<snip>

>  It
> doesn't make sense to put that functionality into the complex numbers
> themselves: complex numbers are unordered, and any order we impose on
> them comes from the collection, not the individual numbers.

Sure -- and that's why we definitely still need a key option for the
sorting functions, and why not all classes should define a sort order. But
for many classes, it does make sense to define a default sort order. Take
something as simple as strings -- they have a default sort order, but a
user very well might want to sort them differently -- say case-insensitive,
or ...

So I think there is consensus here (and no one was proposing otherwise):

 - Defining a sort order is NOT required of all classes
 - The sorting methods definitely need to allow a custom key function.

 the comparison operators
> apply to the values in the collection; the key function applies to the
> collection itself

But many (most) objects DO provide a default sort order, by means of the
comparison methods.  So the idea that objects have a default sort order is
already pretty well accepted :-)

But that assumes that there's only ever one way to order a collection of
> unordered items.

no,. it doesn't -- it implies there is a default way, which is what is done
already for most types. And again, some classes  shouldn't even have a
default.

I'm not sure I understand the motivation to make elements *sortable* but
> not comparable. If an arbitrary order is still useful, I'd think you'd want
> to be able to tell how two particular elements *would* sort by asking a<b.
>

yeah, I'm not sure about this -- going back to the example of complex
numbers, I like
the idea of providing a default sort order but not make the declaration
that this is how the objects SHOULD be compared -- after all, you can
provide your own key function, but not so easily re-define the
comparisons...

> Is sorting-related functionally too special-case to deserve a protocol?
>

> Yes, it is too special-case. I don't see any advantages in comparison with
> defining the __lt__ method. It will be rather confusing if different
> methods of sorting produce inconsistent order. If the first item of the
> sorted list is not the smallest item of the list.

well, yeah, but not any different than the fact that you can inconsitently
define the comparison operators already.

What I wasn't realizing is that you only need to define __lt__ (and __eq__)
to get sortability. Maybe it would be good to document that more
prominently somewhere (if it's documented at all, other than the source).

But the idea of the class decorator looks more sane to me.

yeah, I'm liking that.

Most often when I define equality and comparison dunder methods for a
> custom class, I'm effectively just deferring the comparison to some
> field or tuple of fields of the object.

Exactly -- which may be an argument for a decorator, rather than a dunder
method -- particularly if performance isn't improved by the dunder method.

What if we added a @key_ordering decorator, like @total_ordering but
> using __key__ to generate the comparisons?

This could be a good idea -- just putting it here for the record as it's
mentioned elsewhere.

OK -- multiple posts about performance, I'm going to try to summarize:

> This will add an additional overhead. This will be even slower than
> > passing the key function, since you will need to look up the __key__
> > method in every item.
> That is a reasonable objection.  However, looking up a tp_XXX slot is
> very cheap.
> > Yes, it is too special-case. I don't see any advantages in comparison
> > with defining the __lt__ method.
> There are definitely advantages.  Sorting calls __lt__ for each
> comparison (that is, O(n log n) times) while __key__ would only be
> called once per item at the start (that is, O(n) times).

OK -- so:

if a new slot is added, then the performance hit of __key__ lookup is
minor, but every type now has a new slot, so there is another performance
(and memory) hit for that everywhere -- this is too niche to want to hit
every type.

Any chance that adding a __key__ slot would help with other built-in types
by being faster than checking __lt__ -- probably not.

Conversely, when sorting a key can provide significant performance
> improvements. A single O(n) pass to compute keys, followed by O(n
> log(n)) comparisons could be significantly faster,

snip

>  It can happen, if key(x).__lt__ is sufficiently faster than
> x.__lt__, but that would be unusual.

actually, that's quite common :-) -- but a __sort_key__ method would add n
extra attribute lookups, too. so now it's less likely that it'll be faster
than __lt__

This is "key" :-) -- it is very common for the sort key to be a very simple
type, say int or float, that are very fast to compare -- so this can be a
really performance saver when passing key function.

Also, I recall that someone was working on special-casing the sort code to
do even faster sorting on simple numeric keys -- not sure what came of that
though. But the extra method lookup for every key could overwhelm that
anyway :-(

> And there will be an overhead even in the case when the __key__ method is
not defined.

not good.

I can't think of a way to profile this easily -- we know that having a key
function can be helpful, but that doesn't take into account the extra
method lookup -- maybe a key function that involves a method lookup??

If it defines both, it isn't clear which will be used for sorting.
> Should __lt__ take priority, or __key__? Whichever we choose, somebody
> is going to be upset and confused by the choice.

__sort_key__ would take priority -- that is a no brainer, it's the sort
key, it's used for sorting. And __lt__ giving a different result is no more
surprising, and probably less surprising, than total ordering being
violated in  any other way.

> I disagree with the design of this: it is putting the decision of how to
> sort uncomparable objects in the wrong place, in the object itself
> rather than in the report that wants to sort them.

No -- it's putting the decision about a default sort order in the object
itself -- maybe a niche case, but a real one.

If we have a __key__ method on a class, then the following becomes true:
>     * We can work out which of 2 instances is bigger/smaller using max/min.
>     * We can compare two items by doing a sort.
> So while the *intent* may not be to allow comparisons, that's what you've
> done.

Sure, but Python is for consenting adults -- we allow a lot of things that
aren't encouraged...

I don't think leveraging sort to get a comparison is a problematic use case
at all -- though "min" and "max" is a different matter ... maybe they
shouldn't use __sort_key__? thought that does weaken the whole idea :-)

But why not just define __lt__ and use total_ordering, instead of
> defining two identical decorators that differ only in the name of the
> dunder method they use?

well, one reason is that total_ordering can result in even worse
performance... thought probably not a huge deal.

I never imagined that it would be a required method. Of course it is
> optional. But by adding interpreter support, we're blessing something
> which I think is a misguided design choice as the One Obvious Way.

I don't think it's interpreter support, but rather standard library support
-- it would be changes in the list.sort and sorted, and ... functions, not
any change to the interpreter. (and it would be a change to the language
definition) But your point is that same.

However, I think there is no consensus that it's a misguided design choice
-- having an explicit way to specifically say "this is the default way to
sort these objects", when there is not other reason for total ordering
feels pythonic to me. Again, maybe a niche use case though.

If people like Chris' idea, just add a sortkey() method to your class,
> and then call sorted(collection_of_Spam, key=Spam.sortkey) and it will
> Just Work.

Not a bad approach, though that as useful until/unless it becomes something
of a standard.

(and for the record, if we did add __sort_key__, then on older versions you
could still do:

sorted(collection_of_Spam, key=Spam.__sort_key__)

so too bad on the compatibility front...

The decorator idea LGTM. But it doesn't need the special __key__ method.
> Just pass the key function as a decorator argument.
> This idea was proposed more than 3.5 years ago. Is there a PyPI package
> that implements it?

yes, it was, though not in a terribly public place / way...

I would find it cleaner to express it as a method in the class's body
> ("__key__" or anything else) rather than have to pass a function object.
> Also, it's easier to unit-test if it officially exists as a method...

me too, though couldn't we do both? the decorator, if not passed a sort
function, would look for __key__.

I wondered what the performance would be and tested the following code:

<snip>

it outputs this for my with python 3.6.0
> 10000
> key 0.010628s  10000 calls
>  lt 0.053690s 119886 calls
> It seems that providing a key is ~5 times faster the depending on __lt__.
> (I even used a short circuit to se if __lt__ could beat key).

yeah, this is what I expect for a key function -- exactly why I started all
this expecting a performance gain. But as others' have pointed out, this is
a key function, not a __key__ method that would need to be called each
time. I'm going to try to benchmark (a simulation of) that. This is Barry's
code, with the addition of a "outer_key" function that call the instances
key method:

[I got very similar results as Barry with his version: about 5X faster with
the key function]

def outer_key(item):
    return item.key()

so we get a function lookup each time it's used.

However, I'm confused by the results -- essentially NO Change. That extra
method lookup is coming essentially for free. And this example is using a
tuple as the key, so not the very cheapest possible to sort key.

Did I make a mistake? is that lookup somehow cached?

In [36]: run sort_key_test.py
10000
key       0.012529s  10000 calls
outer_key 0.012139s  10000 calls
lt        0.048057s 119877 calls

each run gives different results, but the lt method is always on order of
5X slower for this size list. Sometimes out_key is faster, mostly a bit
slower, than key.

Also, I tried making a "simpler" __lt__ method:

return (self.value1, self.value2) < (other.value1, other.value2)

but it was bit slower than the previous one -- interesting.

Then I tried a simpler (but probably common) simple attribute sort:

    def __lt__(self, other):
        global lt_calls
        lt_calls += 1

        return self.value1 < other.value1

    def key(self):
        global key_calls
        key_calls += 1

        return self.value1

And that results in about a 6X speedup

In [42]: run sort_key_test.py
10000
key       0.005157s  10000 calls
outer_key 0.007000s  10000 calls
lt        0.041454s 119877 calls
time ratio: 5.922036784741144

And, interestingly (t me!) there is even a performance gain for only  a 10
item list! (1.5X or so, but still)

In fact, this seems to show that having a __key__ method would  often
provide a performance boost, even for small lists.

(still to try to figure out -- how much would the look for __key__ slow
things down when it didn't exist....)

I have got to be doing something wrong here -- I hope some of you smarter
folks will tell me what :-)

Code enclosed.

-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris.Barker at noaa.gov
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20171204/cd7cf11f/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: sort_key_test.py
Type: text/x-python-script
Size: 1498 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20171204/cd7cf11f/attachment-0001.bin>