[Python-Dev] Drastically improving list.sort() for lists of strings/ints

Sun Sep 11 03:45:24 EDT 2016

Hello all,

I am interested in making a non-trivial improvement to list.sort(), but
before I put in the work, I want to test the waters and see if this is
something the community would accept. Basically, I want to implement radix
sort for lists of strings. So list.sort() would detect if it is sorting a
list of strings (which is one of the more common things you sort in python)
and, if so, use in-place radix sort (see https://xlinux.nist.gov/dads/
HTML/americanFlagSort.html). In-place radix sort is significantly faster
for lexicographic sorting than Timsort (or in general any comparison-based
sort, since radix can beat the nlogn barrier). If you don't believe that
last claim, suppose for the sake of the argument that it's true (because if
I actually implemented this I could supply benchmarks to prove it).

The idea is the following: in list.sort(), if using the default comparison
operator, test the type of the first, middle, and last elements (or
something along those lines). If they are all strings, in practice this
means the list is very likely a list of strings, so it's probably worth the
investment to check and see. So we iterate through and see if they really
are all strings (again, in practice it is very unlikely this test would
fail). Luckily, this is very, very nearly free (i.e. no memory cost) since
we have to iterate through anyway as part of the in-place radix sort (first
step is to count how many elements go in each bucket, you iterate through
to count. So we would just put a test in the loop to break if it finds a
non-string).

Additionally, since often one is sorting objects with string or int fields
instead of strings or ints directly, one could check the type of the field
extracted by the key function or something.

Is this something the community would be interested inf? Again, supposing I
benchmark it and show it gives non-trivial improvements on existing
codebases. Like for example I could benchmark the top 10 packages on pypi
and show they run faster using my list.sort().

TL;DR: Should I spend time making list.sort() detect if it is sorting
lexicographically and, if so, use a much faster algorithm?

Thanks for reading,
Elliot
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-dev/attachments/20160911/93533d19/attachment.html>