[Python-ideas] Exploiting type-homogeneity in list.sort() (again!)

Sun Mar 5 02:19:43 EST 2017

(Summary of results: my patch at https://bugs.python.org/issue28685 makes
list.sort() 30-50% faster in common cases, and at most 1.5% slower in the
uncommon worst case.)

Hello all,

You may remember seeing some messages on here about optimizing list.sort()
by exploiting type-homogeneity: since comparing apples and oranges is
uncommon (though possible, i.e. float to int), it pays off to check if the
list is type-homogeneous (as well as homogeneous with respect to some other
properties), and, if it is, to replace calls to PyObject_RichCompareBool
with calls to ob_type->tp_richcompare (or, in common special cases, to
optimized compare functions). The entire patch is less than 250 lines of
code, most of which is pretty boilerplate (i.e. a lot of assertions in
#ifdef Py_DEBUG blocks, etc).

I originally wrote that patch back in November. I've learned a lot since
then, both about CPython and about mailing list etiquette :). Previous
discussion about this can be found at
https://mail.python.org/pipermail/python-dev/2016-October/146648.html and
https://mail.python.org/pipermail/python-ideas/2016-October/042807.html.

Anyway, I recently redid the benchmarks much more rigorously (in
preparation for presenting this project at my school's science fair),
achieving a standard deviation of less than 0.5% of the mean for all
measurements. The exact benchmark script used can be found at
https://github.com/embg/python-fastsort-benchmark (it's just sorting random
lists of/lists of tuples of [type]. While listsort.txt talks about
benchmarking different kinds of structured lists, instead of just random
lists, the results here would hold in those cases just as well, because
this makes individual comparisons cheaper, instead of reducing the number
of comparisons based on structure).

I also made a poster describing the optimization and including a pretty
graph displaying the benchmark data:
https://github.com/embg/python-fastsort-benchmark/blob/master/poster.pdf.
For those who would rather read the results here (though it is a *really*
pretty graph):

***
Percent improvement for sorting random lists of [type]
(1-patched/unpatched):
float: 48%
bounded int (magnitude smaller than 2^32): 48.4%
latin string (all characters in [0,255]): 32.7%
general int (reasonably uncommon?): 17.2%
general string (reasonably uncommon?): 9.2%
tuples of float: 63.2%
tuples of bounded int: 64.8%
tuples of latin string: 55.8%
tuples of general int: 50.3%
tuples of general string: 44.1%
tuples of heterogeneous: 41.5%
heterogeneous (lots of float with an int at the end; worst-case): -1.5%
***

Essentially, it's a gamble where the payoff is 20-30 times greater than the
cost, and the odds of losing are very small. Sorting is perhaps not a
bottleneck in most applications, but considering how much work has gone
into Python's sort (Timsort, etc; half of listobject.c is sort code), I
think it's interesting that list.sort() can be made essentially twice
faster by a relatively simple optimization. I would also add that Python
dictionaries already implement this optimization: they start out optimizing
based on the assumption that they'll only be seeing string keys, checking
to make sure that assumption holds as they go. If they see a non-string
key, they permanently switch over to the general implementation. So it's
really the same idea, except here it doesn't matter as much what type we're
dealing with, which is important, because lists are commonly used with lots
of different types, as opposed to dictionaries, which overwhelmingly
commonly use string keys, especially internally. (Correct me if I'm wrong
in any of the above).

I got a lot of great feedback/input on this patch as I was writing it, but
after submitting it, I didn't hear much from anybody. (The reason I took so
long to post was because I wanted to wait until I had the chance to do the
benchmarks *right*). What do you all think?

Thanks,
Elliot
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20170305/1dea7f66/attachment.html>