[Python-Dev] Drastically improving list.sort() for lists of strings/ints

Sun Sep 11 12:14:06 EDT 2016

On Sun, Sep 11, 2016 at 5:45 PM, Elliot Gorokhovsky
<elgo8537 at colorado.edu> wrote:
> I am interested in making a non-trivial improvement to list.sort(), but
> before I put in the work, I want to test the waters and see if this is
> something the community would accept. Basically, I want to implement radix
> sort for lists of strings. So list.sort() would detect if it is sorting a
> list of strings (which is one of the more common things you sort in python)
> and, if so, use in-place radix sort (see
> https://xlinux.nist.gov/dads/HTML/americanFlagSort.html). In-place radix
> sort is significantly faster for lexicographic sorting than Timsort (or in
> general any comparison-based sort, since radix can beat the nlogn barrier).
> If you don't believe that last claim, suppose for the sake of the argument
> that it's true (because if I actually implemented this I could supply
> benchmarks to prove it).

I'd like to see these benchmarks, actually. Sounds interesting. How
does it fare on different distributions of data - for instance,
strings consisting exclusively of ASCII digits and punctuation eg
"01:12:35,726 --> 01:12:36,810", or strings consisting primarily of
ASCII but with occasional BMP or astral characters, or strings
primarily of Cyrillic text, etc? What if every single string begins
with a slash character (eg if you're sorting a bunch of path names)?
At what list size does it become visibly faster?

Could this be put onto PyPI and then benchmarked as lst.sort() vs
flagsort(lst) ?

ChrisA