Selecting k smallest or largest elements from a large list in python; (benchmarking)

Arnaud Delobelle arnodel at googlemail.com
Thu Sep 2 09:47:57 CEST 2010


On Sep 2, 7:59 am, Peter Otten <__pete... at web.de> wrote:
> Dmitry Chichkov wrote:
> > Given: a large list (10,000,000) of floating point numbers;
> > Task: fastest python code that finds k (small, e.g. 10) smallest
> > items, preferably with item indexes;
> > Limitations: in python, using only standard libraries (numpy & scipy
> > is Ok);
>
> > I've tried several methods. With N = 10,000,000, K = 10 The fastest so
> > far (without item indexes) was pure python implementation
> > nsmallest_slott_bisect (using bisect/insert). And with indexes
> > nargsmallest_numpy_argmin (argmin() in the numpy array k times).
>
> > Anyone up to the challenge beating my code with some clever selection
> > algorithm?
>
> If you don't care about stability, i. e. whether nlargest(2, [1, 2, 2.0])
> returns [1, 2] or [1, 2.0], use
>
> _heapq.nlargest() directly
>
> $ python nsmallest_perf2.py
> nsmallest              --> 0.142784833908
> nsmallest_slott_bisect --> 0.19291305542
> $ cat nsmallest_perf2.py
> from random import randint, random
> import time
> from bisect    import insort
> from itertools import islice
> import _heapq
>
> _funcs = []
> def timeit(f):
>     _funcs.append(f)
>
> def time_all(*args):
>     funcs = _funcs
>     width = max(len(f.__name__) for f in funcs)
>     prev = None
>     for f in funcs:
>         start = time.time()
>         result = f(*args)
>         end = time.time()
>         print "%-*s --> %10s" % (width, f.__name__, end - start)
>         if prev is None:
>             prev = result
>         else:
>             assert prev == result
>
> timeit(_heapq.nsmallest)
>
> @timeit
> def nsmallest_slott_bisect(n, iterable, insort=insort):
>     it   = iter(iterable)
>     mins = sorted(islice(it, n))
>     for el in it:
>         if el <= mins[-1]: #NOTE: equal sign is to preserve duplicates
>             insort(mins, el)
>             mins.pop()
>     return mins
>
> test_data = [randint(10, 50) + random() for i in range(10**6)]
> K = 10
>
> time_all(K, test_data)
>
> Peter

I get:

1.46s for _heapq.nsmallest
0.85s for nsmallest_slott_bisect2 (version I posted)

I am a bit surprised that mine is so slow compared with yours.  I'll
do more tests later!
--
Arnaud



More information about the Python-list mailing list