efficient partial sort in Python ?
Dan Stromberg
drsalists at gmail.com
Tue Aug 19 19:05:51 EDT 2014
On Tue, Aug 19, 2014 at 12:37 PM, Chiu Hsiang Hsu <wdv4758h at gmail.com> wrote:
> On Tuesday, August 19, 2014 5:42:27 AM UTC+8, Dan Stromberg wrote:
>> On Mon, Aug 18, 2014 at 10:18 AM, Chiu Hsiang Hsu <wdv4758h at gmail.com> wrote:
>>
>> > I know that Python use Timsort as default sorting algorithm and it is efficient,
>>
>> > but I just wanna have a partial sorting (n-largest/smallest elements).
>>
>>
>>
>> Perhaps heapq with Pypy? Or with nuitka? Or with numba?
> Another problem with heapq is the memory usage, it cost a lot of more memory with heapq in CPython (I test it in 3.4 with 1000000 float numbers) compare to sorted.
This surprises me. I believe heapq probably keeps values in a python
list with no extra references, by making node i's left child and right
child be array elements 2*i and 2*i+1, respectively.
A heap of some sort probably is best algorithmically. You're probably
just up against a high constant. On the other hand, there are many
kinds of heaps.
> For curiosity, there are many speed up solution in Python (like Cython, PyPy), I hasn't use Cython before,
> I guess PyPy is a more convient way to speed up current Python code (?),
> so how does Cython compare to PyPy ? (speed, code, flexibility, or anything else)
PyPy is really fast for CPU-intensive workloads, but CPython is better for I/O.
I tested a single CPU-intensive microbenchmark of Cython and PyPy
(also Jython and CPython). PyPy was fastest
(http://stromberg.dnsalias.org/~strombrg/backshift/documentation/performance/index.html).
I haven't yet compared numba or nuitka or Shedskin.
When you use heapq, are you putting all the values in the heap, or
just up to n at a time (evicting the worst value, one at a time as you
go)? If you're doing the former, it's basically a heapsort which
probably won't beat timsort. If you're doing the latter, that should
be pretty good.
More information about the Python-list
mailing list