list.sort, was Re: [Python-Dev] decorate-sort-undecorate
Tim Peters
tim.one at comcast.net
Sat Nov 15 18:32:30 EST 2003
[Armin Rigo]
>>>> from heapq import *
>>>> def isorted(iterable):
>>>> heap = list(iterable)
>>>> heapify(heap)
>>>> while heap:
>>>> yield heappop(heap)
[Raymond Hettinger]
>>> In terms of memory, I think list.sort() always beats the above
>>> implementation.
[Tim]
>> That can't be -- the heap method only requires a fixed (independent
>> of N) and small amount of working storage. list.sort() may need to
>> allocate O(N) additional temp bytes under the covers (to create a
>> working area for doing merges; it can be expected to allocate 2*N
>> temp bytes for a random array of len N, which is its worst case; if
>> there's a lot of pre-existing order in the input array, it can
>> sometimes get away without allocating any temp space).
[Raymond]
> The isorted() generator shown above operates on a copy of the data
> while list.sort() works in-place. So, my take on it was the
> isorted() always used 2*N while list.sort() used 2*N only in the
> worst case.
Ah. But that's comparing apples and donkeys: Armin's example works on any
iterable, while list.sort() only works on lists. I assumed that by
"list.sort()" you meant "the obvious method *based* on list.sort() also
accepting any iterable", i.e.,
def isorted(iterable):
copy = list(iterable)
copy.sort()
for x in copy:
yield x
Then it's got all the space overhead of the list copy in Armin's version,
plus the additional hidden temp memory allocated by sort.
Something to note: most applications that only want the "first N" or "last
N" values in sorted order know N in advance, and that's highly exploitable.
David Eppstein and I had a long thread about that here a while back. The
example of implementing an "N-best queue" in the heapq test suite is a much
better use of heaps when N is known, accepting an iterable directly (without
turning it into a list first), and using storage for only N items. When N
is (as is typical) much smaller than the total number of elements, that
method can beat the pants off list.sort() even with the Python
implementation of heaps. Indeed, Guido and I used that method for
production code in Zope's full-text search subsystem (find the N best
matches to a search query over some 10-200K documents).
David presented a method that ran even faster, provided it was coded just
right, based on doing quicksort-like partitioning steps on a buffer of about
3*N values. That also uses total space proportional to N (independent of
the total number of incoming elements). A heap-based N-best queue would
probably beat that again now that heaps are implemented in C. OTOH, if we
implemented a quicksort-like partitioning routine in C too ... (it also
suffers from gobs of fiddly little integer arithmetic and simple array
indexing, which screams in C).
More information about the Python-Dev
mailing list