Selecting k smallest or largest elements from a large list in python; (benchmarking)

Thu Sep 2 02:59:59 EDT 2010

Dmitry Chichkov wrote:

> Given: a large list (10,000,000) of floating point numbers;
> Task: fastest python code that finds k (small, e.g. 10) smallest
> items, preferably with item indexes;
> Limitations: in python, using only standard libraries (numpy & scipy
> is Ok);
> 
> I've tried several methods. With N = 10,000,000, K = 10 The fastest so
> far (without item indexes) was pure python implementation
> nsmallest_slott_bisect (using bisect/insert). And with indexes
> nargsmallest_numpy_argmin (argmin() in the numpy array k times).
> 
> Anyone up to the challenge beating my code with some clever selection
> algorithm?

If you don't care about stability, i. e. whether nlargest(2, [1, 2, 2.0]) 
returns [1, 2] or [1, 2.0], use

_heapq.nlargest() directly

$ python nsmallest_perf2.py
nsmallest              --> 0.142784833908
nsmallest_slott_bisect --> 0.19291305542
$ cat nsmallest_perf2.py
from random import randint, random
import time
from bisect    import insort
from itertools import islice
import _heapq

_funcs = []
def timeit(f):
    _funcs.append(f)

def time_all(*args):
    funcs = _funcs
    width = max(len(f.__name__) for f in funcs)
    prev = None
    for f in funcs:
        start = time.time()
        result = f(*args)
        end = time.time()
        print "%-*s --> %10s" % (width, f.__name__, end - start)
        if prev is None:
            prev = result
        else:
            assert prev == result

timeit(_heapq.nsmallest)

@timeit
def nsmallest_slott_bisect(n, iterable, insort=insort):
    it   = iter(iterable)
    mins = sorted(islice(it, n))
    for el in it:
        if el <= mins[-1]: #NOTE: equal sign is to preserve duplicates
            insort(mins, el)
            mins.pop()
    return mins

test_data = [randint(10, 50) + random() for i in range(10**6)]
K = 10

time_all(K, test_data)

Peter