I brought up heapq last week, but there was only brief discussion before the issue got sidetracked into a discussion of FIFO queues. I'd like to revisit heapq. The two people who responded seemed to agree that the existing heapq interface was lacking, and this seemed to be the sentiment many months ago when heapq was added.
I'll summarize some of the heap interfaces that have been proposed:
- the heapq currently in CVS: - Provides functions to manipulate a list organized as a binary heap - Advantages: - Internal binary heap structure is transparent to user, useful for educational purposes - Low overhead - Already in CVS
- My MinHeap/MaxHeap classes: - Provides a class with heap access routines, using a list internally - Advantages: - Implementation is opaque, so it can be replaced later with Fibonacci heaps or Paired heaps without breaking user programs - Provides an adjust_key() command needed by some applications (e.g. Dijkstra's Algorithm)
- David Eppstein's priorityDictionary class: - Provides a class with a dictionary-style interface (ex: heap['cat'] = 5 would give 'cat' a priority of 5 in the heap) - Advantages: - Implementation is opaque, so it can be replaced later with Fibonacci heaps or Paired heaps without breaking user programs - A dictionary interface may be more intuitive for certain applications - Limitation: - Objects with the same value may only have a single instance in the heap.
I'd very much like to see the current heapq replaced with a different interface in time for 2.3. I believe that an opaque object is better, since it allows more flexibility later. If the current heapq is released, user program will start to use it, and then it will be much more difficult to switch to a different heap algorithm later, should that become desirable. Also, decrease-key is an important feature that many users will expect from a heap; this operation is notably missing from heapq.
I'm willing to do whatever work is necessary to get a more flexible heap interface into 2.3. If the consensus prefers my MinHeap (or something similar), I'll gladly write documentation (and have already written rather brutal tests).
Somebody with authority, just tell me where to pour my energy in this matter :)
-- Agthorr
I'd very much like to see the current heapq replaced with a different interface in time for 2.3. I believe that an opaque object is better, since it allows more flexibility later.
I'm quite pleased with the version already in CVS. It is a small masterpiece of exposition, sophistication, simplicity, and speed. A class based interface is not necessary for every algorithm.
For the other approaches, what might be useful is to define an API and leave it that. The various implementations can be maintained on Cookbook pages, the Vaults of Parnassus, or an SF project.
The min/max heap and fibonacci heaps are a great idea. Nice work.
Raymond Hettinger
In article 001901c30aab$bf31c060$b6b8958d@oemcomputer, "Raymond Hettinger" python@rcn.com wrote:
I'd very much like to see the current heapq replaced with a different interface in time for 2.3. I believe that an opaque object is better, since it allows more flexibility later.
I'm quite pleased with the version already in CVS. It is a small masterpiece of exposition, sophistication, simplicity, and speed. A class based interface is not necessary for every algorithm.
It has some elegance, but omits basic operations that are necessary for many heap-based algorithms and are not provided by this interface. Specifically, the three algorithms that use heaps in my upper-division undergraduate algorithms classes are heapsort (for which heapq works fine, but you would generally want to use L.sort() instead), Dijkstra's algorithm (and its relatives such as A* and Prim), which needs the ability to decrease keys, and event-queue-based plane sweep algorithms (e.g. for finding all crossing pairs in a set of line segments) which need the ability to delete items from other than the top.
To see how important the lack of these operations is, I decided to compare two implementations of Dijkstra's algorithm. The priority-dict implementation from http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/119466 takes as input a graph, coded as nested dicts {vertex: {neighbor: edge length}}. This is a variation of a graph coding suggested in one of Guido's essays that, as Raymond suggests, avoids using a separate class based interface.
Here's a simplification of my dictionary-based Dijkstra implementation:
def Dijkstra(G,start,end=None): D = {} # dictionary of final distances P = {} # dictionary of predecessors Q = priorityDictionary() # est.dist. of non-final vert. Q[start] = 0 for v in Q: D[v] = Q[v] for w in G[v]: vwLength = D[v] + G[v][w] if w not in D and (w not in Q or vwLength < Q[w]): Q[w] = vwLength P[w] = v return (D,P)
Here's a translation of the same implementation to heapq (untested since I'm not running 2.3). Since there is no decrease in heapq, nor any way to find and remove old keys, I changed the algorithm to add new tuples for each new key, leaving the old tuples in place until they bubble up to the top of the heap.
def Dijkstra(G,start,end=None): D = {} # dictionary of final distances P = {} # dictionary of predecessors Q = [(0,None,start)] # heap of (est.dist., pred., vert.) while Q: dist,pred,v = heappop(Q) if v in D: continue # tuple outdated by decrease-key, ignore D[v] = dist P[v] = pred for w in G[v]: heappush(Q, (D[v] + G[v][w], v, w)) return (D,P)
My analysis of the differences between the two implementations:
- The heapq version is slightly complicated (the two lines if...continue) by the need to explicitly ignore tuples with outdated priorities. This need for inserting low-level data structure maintenance code into higher-level algorithms is intrinsic to using heapq, since its data is not structured in a way that can support efficient decrease key operations.
- Since the heap version had no way to determine when a new key was smaller than an old one, the heapq implementation needed two separate data structures to maintain predecessors (middle elements of tuples for items in queue, dictionary P for items already removed from queue). In the dictionary implementation, both types of items stored their predecessors in P, so there was no need to transfer this information from one structure to another.
- The dictionary version is slightly complicated by the need to look up old heap keys and compare them with the new ones instead of just blasting new tuples onto the heap. So despite the more-flexible heap structure of the dictionary implementation, the overall code complexity of both implementations ends up being about the same.
- Heapq forced me to build tuples of keys and items, while the dictionary based heap did not have the same object-creation overhead (unless it's hidden inside the creation of dictionary entries). On the other hand, since I was already building tuples, it was convenient to also store predecessors in them instead of in some other structure.
- The heapq version uses significantly more storage than the dictionary: proportional to the number of edges instead of the number of vertices.
- The changes I made to Dijkstra's algorithm in order to use heapq might not have been obvious to a non-expert; more generally I think this lack of flexibility would make it more difficult to use heapq for cookbook-type implementation of textbook algorithms.
- In Dijkstra's algorithm, it was easy to identify and ignore outdated heap entries, sidestepping the inability to decrease keys. I'm not convinced that this would be as easy in other applications of heaps.
- One of the reasons to separate data structures from the algorithms that use them is that the data structures can be replaced by ones with equivalent behavior, without changing any of the algorithm code. The heapq Dijkstra implementation is forced to include code based on the internal details of heapq (specifically, the line initializing the heap to be a one element list), making it less flexible for some uses. The usual reason one might want to replace a data structure is for efficiency, but there are others: for instance, I teach various algorithms classes and might want to use an implementation of Dijkstra's algorithm as a testbed for learning about different priority queue data structures. I could do that with the dictionary-based implementation (since it shows nothing of the heap details) but not the heapq one.
Overall, while heapq was usable for implementing Dijkstra, I think it has significant shortcomings that could be avoided by a more well-thought-out interface that provided a little more functionality and a little clearer separation between interface and implementation.
[Raymond Hettinger]
I'm quite pleased with the version already in CVS. It is a small masterpiece of exposition, sophistication, simplicity, and speed. A class based interface is not necessary for every algorithm.
[David Eppstein]
It has some elegance, but omits basic operations that are necessary for many heap-based algorithms and are not provided by this interface.
I think Raymond was telling you it isn't intended to be "an interface", rather it's quite up-front about being a collection of functions that operate directly on a Python list, implementing a heap in a very straightforward way, and deliberately not trying to hide any of that. IOW, it's a concrete data type, not an abstract one. I asked, and it doesn't feel like apologizing for being what it is <wink>.
That's not to say Python couldn't benefit from providing an abstract heap API too, and umpteen different implementations specialized to different kinds of heap applications. It is saying that heapq isn't trying to be that, so pointing out that it isn't falls kinda flat.
Specifically, the three algorithms that use heaps in my upper-division undergraduate algorithms classes are heapsort (for which heapq works fine, but you would generally want to use L.sort() instead), Dijkstra's algorithm (and its relatives such as A* and Prim), which needs the ability to decrease keys, and event-queue-based plane sweep algorithms (e.g. for finding all crossing pairs in a set of line segments) which need the ability to delete items from other than the top.
Then some of those will want a different implementation of a heap. The algorithms in heapq are still suitable for many heap applications, such as maintaining an N-best list (like retaining only the 10 best-scoring items in a long sequence), and A* on a search tree (when there's only one path to a node, decrease-key isn't needed; A* on a graph is harder).
To see how important the lack of these operations is, I decided to compare two implementations of Dijkstra's algorithm.
I don't think anyone claimed-- or would claim --that a heapq is suitable for all heap purposes.
The priority-dict implementation from http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/119466 takes as input a graph, coded as nested dicts {vertex: {neighbor: edge length}}. This is a variation of a graph coding suggested in one of Guido's essays that, as Raymond suggests, avoids using a separate class based interface.
Here's a simplification of my dictionary-based Dijkstra implementation:
def Dijkstra(G,start,end=None): D = {} # dictionary of final distances P = {} # dictionary of predecessors Q = priorityDictionary() # est.dist. of non-final vert. Q[start] = 0 for v in Q: D[v] = Q[v] for w in G[v]: vwLength = D[v] + G[v][w] if w not in D and (w not in Q or vwLength < Q[w]): Q[w] = vwLength P[w] = v return (D,P)
Here's a translation of the same implementation to heapq (untested since I'm not running 2.3). Since there is no decrease in heapq, nor any way to find and remove old keys,
A heapq *is* a list, so you could loop over the list to find an old object. I wouldn't recommend that in general <wink>, but it's easy, and if the need is rare then the advertised fact that a heapq is a plain list can be very convenient. Deleting an object from "the interior" still isn't supported directly, of course. It's possible to do so efficiently with this implementation of a heap, but since it doesn't support an efficient way to find an old object to begin with, there seemed little point to providing an efficient delete-by-index function. Here's one such:
import heapq
def delete_obj_at_index(heap, pos): lastelt = heap.pop() if pos >= len(heap): return
# The rest is a lightly fiddled variant of heapq._siftup. endpos = len(heap) # Bubble up the smaller child until hitting a leaf. childpos = 2*pos + 1 # leftmost child position while childpos < endpos: # Set childpos to index of smaller child. rightpos = childpos + 1 if rightpos < endpos and heap[rightpos] <= heap[childpos]: childpos = rightpos # Move the smaller child up. heap[pos] = heap[childpos] pos = childpos childpos = 2*pos + 1 # The leaf at pos is empty now. Put lastelt there, and and bubble # it up to its final resting place (by sifting its parents down). heap[pos] = lastelt heapq._siftdown(heap, 0, pos)
I changed the algorithm to add new tuples for each new key, leaving the old tuples in place until they bubble up to the top of the heap.
def Dijkstra(G,start,end=None): D = {} # dictionary of final distances P = {} # dictionary of predecessors Q = [(0,None,start)] # heap of (est.dist., pred., vert.) while Q: dist,pred,v = heappop(Q) if v in D: continue # tuple outdated by decrease-key, ignore D[v] = dist P[v] = pred for w in G[v]: heappush(Q, (D[v] + G[v][w], v, w)) return (D,P)
My analysis of the differences between the two implementations:
- The heapq version is slightly complicated (the two lines
if...continue) by the need to explicitly ignore tuples with outdated priorities. This need for inserting low-level data structure maintenance code into higher-level algorithms is intrinsic to using heapq, since its data is not structured in a way that can support efficient decrease key operations.
It surprised me that you tried using heapq at all for this algorithm. I was also surprised that you succeeded <0.9 wink>.
- Since the heap version had no way to determine when a new key was
smaller than an old one, the heapq implementation needed two separate data structures to maintain predecessors (middle elements of tuples for items in queue, dictionary P for items already removed from queue). In the dictionary implementation, both types of items stored their predecessors in P, so there was no need to transfer this information from one structure to another.
- The dictionary version is slightly complicated by the need to look up
old heap keys and compare them with the new ones instead of just blasting new tuples onto the heap. So despite the more-flexible heap structure of the dictionary implementation, the overall code complexity of both implementations ends up being about the same.
- Heapq forced me to build tuples of keys and items, while the
dictionary based heap did not have the same object-creation overhead (unless it's hidden inside the creation of dictionary entries).
Rest easy, it's not.
On the other hand, since I was already building tuples, it was convenient to also store predecessors in them instead of in some other structure.
- The heapq version uses significantly more storage than the dictionary:
proportional to the number of edges instead of the number of vertices.
- The changes I made to Dijkstra's algorithm in order to use heapq might
not have been obvious to a non-expert; more generally I think this lack of flexibility would make it more difficult to use heapq for cookbook-type implementation of textbook algorithms.
Depends on the specific algorithms in question, of course. No single heap implementation is the best choice for all algorithms, and heapq would be misleading people if, e.g., it did offer a decrease_key function -- it doesn't support an efficient way to do that, and it doesn't pretend to.
- In Dijkstra's algorithm, it was easy to identify and ignore outdated
heap entries, sidestepping the inability to decrease keys. I'm not convinced that this would be as easy in other applications of heaps.
All that is explaining why this specific implementation of a heap isn't suited to the task at hand. I don't believe that was at issue, though. An implementation of a heap that is suited for this task may well be less suited for other tasks.
- One of the reasons to separate data structures from the algorithms
that use them is that the data structures can be replaced by ones with equivalent behavior, without changing any of the algorithm code. The heapq Dijkstra implementation is forced to include code based on the internal details of heapq (specifically, the line initializing the heap to be a one element list), making it less flexible for some uses. The usual reason one might want to replace a data structure is for efficiency, but there are others: for instance, I teach various algorithms classes and might want to use an implementation of Dijkstra's algorithm as a testbed for learning about different priority queue data structures. I could do that with the dictionary-based implementation (since it shows nothing of the heap details) but not the heapq one.
You can wrap any interface you like around heapq (that's very easy to do in Python), but it won't change that heapq's implementation is poorly suited to this application. priorityDictionary looks like an especially nice API for this specific algorithm, but, e.g., impossible to use directly for maintaining an N-best queue (priorityDictionary doesn't support multiple values with the same priority, right? if we're trying to find the 10,000 poorest people in America, counting only one as dead broke would be too Republican for some peoples' tastes <wink>). OTOH, heapq is easy and efficient for *that* class of heap application.
Overall, while heapq was usable for implementing Dijkstra, I think it has significant shortcomings that could be avoided by a more well-thought-out interface that provided a little more functionality and a little clearer separation between interface and implementation.
heapq isn't trying to separate them at all -- quite the contrary! It's much like the bisect module that way. They find very good uses in practice.
I should note that I objected to heapq at the start, because there are several important heap implementation techniques, and just one doesn't fit anyone all the time. My objection went away when Guido pointed out how much like bisect it is: since it doesn't pretend one whit to generality or opaqueness, it can't be taken as promising more than it actually does, nor can it interfere with someone (so inclined) defining a general heap API: it's not even a class, just a handful of functions. Useful, too, just as it is. A general heap API would be nice, but it wouldn't have much (possibly nothing) to do with heapq.
On 5/1/03 2:13 AM -0400 Tim Peters tim_one@email.msn.com wrote:
It surprised me that you tried using heapq at all for this algorithm. I was also surprised that you succeeded <0.9 wink>.
Wink noted, but it surprised me too, a little. I had thought decrease key was a necessary part of the algorithm, not something that could be finessed like that.
You can wrap any interface you like around heapq (that's very easy to do in Python), but it won't change that heapq's implementation is poorly suited to this application. priorityDictionary looks like an especially nice API for this specific algorithm, but, e.g., impossible to use directly for maintaining an N-best queue (priorityDictionary doesn't support multiple values with the same priority, right? if we're trying to find the 10,000 poorest people in America, counting only one as dead broke would be too Republican for some peoples' tastes <wink>). OTOH, heapq is easy and efficient for *that* class of heap application.
I agree with your main points (heapq's inability to handle certain priority queue applications doesn't mean it's useless, and its implementation-specific API helps avoid fooling programmers into thinking it's any more than what it is). But I am confused at this example. Surely it's just as easy to store (income,identity) tuples in either data structure.
If you mean, you want to find the 10k smallest income values (rather than the people having those incomes), then it may be that a better data structure would be a simple list L in which the value of L[i] is the count of people with income i.
[Tim]
... priorityDictionary looks like an especially nice API for this specific algorithm, but, e.g., impossible to use directly for maintaining an N- best queue (priorityDictionary doesn't support multiple values with the same priority, right?
That was wrong: the dict maps items to priorities, and I read it backwards. Sorry!
if we're trying to find the 10,000 poorest people in America, counting only one as dead broke would be too Republican for some peoples' tastes <wink>). OTOH, heapq is easy and efficient for *that* class of heap application.
[David Eppstein]
I agree with your main points (heapq's inability to handle certain priority queue applications doesn't mean it's useless, and its implementation-specific API helps avoid fooling programmers into thinking it's any more than what it is). But I am confused at this example. Surely it's just as easy to store (income,identity) tuples in either data structure.
As above, I was inside out.
"Just as easy" can't be answered without trying to write actual code, though. Given that heapq and priorityDictionary are both min-heaps, to avoid artificial pain let's look for the people with the N highest incomes instead.
For an N-best queue using heapq, "the natural" thing is to define people like so:
class Person: def __init__(self, income): self.income = income
def __cmp__(self, other): return cmp(self.income, other.income)
and then the N-best calculation is as follows; it's normal in N-best applications that N is much smaller than the number of items being ranked, and you don't want to consume more than O(N) memory (for example, google wants to show you the best-scoring 25 documents of the 6 million matches it found):
""" # N-best queue for people with the N largest incomes. import heapq
dummy = Person(-1) # effectively an income of -Inf q = [dummy] * N # it's fine to use the same object N times
for person in people: if person > q[0]: heapq.heapreplace(q, person)
# The result list isn't sorted. result = [person for person in q if q is not dummy] """
I'm not as experienced with priorityDictionary. For heapq, the natural __cmp__ is the one that compares objects' priorities. For priorityDictionary, we can't use that, because Person instances will be used as dict keys, and then two Persons with the same income couldn't be in the queue at the same time. So Person.__cmp__ will have to change in such a way that distinct Persons never compare equal. I also have to make sure that a Person is hashable. I see there's another subtlety, apparent only from reading the implementation code: in the heap maintained alongside the dict, it's actually
(priority, object)
tuples that get compared. Since I expect to see Persons with equal income, when two such tuples get compared, they'll tie on the priority, and go on to compare the Persons. So I have to be sure too that comparing two Persons is cheap.
Pondering all that for a while, it seems best to make sure Person doesn't define __cmp__ or __hash__ at all. Then instances will get compared by memory address, distinct Persons will never compare equal, comparing Persons is cheap, and hashing by memory address is cheap too:
class Person: def __init__(self, income): self.income = income
The N-best code is then:
""" q = priorityDictionary()
for dummy in xrange(N): q[Person(-1)] = -1 # have to ensure these are distinct Persons
for person in people: if person.income > q.smallest().income: del q[q.smallest()] q[person] = person.income
# The result list is sorted. result = [person for person in q if person.income != -1] """
Perhaps paradoxically, I had to know essentially everything about how priorityDictionary is implemented to write a correct and efficient algorithm here. That was true of heapq too, of course, but there were fewer subtleties to trip over there, and heapq isn't trying to hide its implementation.
BTW, there's a good use of heapq for you: you could use it to maintain the under-the-covers heap inside priorityDictionary! It would save much of the code, and possibly speed it too (heapq's implementation of popping usually requires substantially fewer comparisons than priorityDictionary.smallest uses; this is explained briefly in the comments before _siftup, deferring to Knuth for the gory details).
If you mean, you want to find the 10k smallest income values (rather than the people having those incomes), then it may be that a better data structure would be a simple list L in which the value of L[i] is the count of people with income i.
Well, leaving pennies out of it, incomes in the USA span 9 digits, so something taking O(N) memory would still be most attractive.
On 5/4/03 1:26 AM -0400 Tim Peters tim.one@comcast.net wrote:
it's normal in N-best applications that N is much smaller than the number of items being ranked, and you don't want to consume more than O(N) memory (for example, google wants to show you the best-scoring 25 documents of the 6 million matches it found):
Ok, I think you're right, for this sort of thing heapq is better. One could extend my priorityDictionary code to limit memory like this but it would be unnecessary work when the extra features it has over heapq are not used for this sort of algorithm.
On the other hand, if you really want to find the n best items in a data stream large enough that you care about using only space O(n), it might also be preferable to take constant amortized time per item rather than the O(log n) that heapq would use, and it's not very difficult nor does it require any fancy data structures. Some time back I needed some Java code for this, haven't had an excuse to port it to Python. In case anyone's interested, it's online at http://www.ics.uci.edu/~eppstein/161/KBest.java. Looking at it now, it seems more complicated than it needs to be, but maybe that's just the effect of writing in Java instead of Python (I've seen an example of a three-page Java implementation of an algorithm in a textbook that could easily be done in a dozen Python lines).
In article <17342817.1052002018@[10.0.1.2]>, David Eppstein eppstein@ics.uci.edu wrote:
On the other hand, if you really want to find the n best items in a data stream large enough that you care about using only space O(n), it might also be preferable to take constant amortized time per item rather than the O(log n) that heapq would use, and it's not very difficult nor does it require any fancy data structures. Some time back I needed some Java code for this, haven't had an excuse to port it to Python. In case anyone's interested, it's online at http://www.ics.uci.edu/~eppstein/161/KBest.java.
BTW, the central idea here is to use a random quicksort pivot to shrink the list, when it grows too large.
In python, this could be done without randomization as simply as
def addToNBest(L,x,N): L.append(x) if len(L) > 2*N: L.sort() del L[N:]
It's not constant amortized time due to the sort, but that's probably more than made up for due to the speed of compiled sort versus interpreted randomized pivot.
[David Eppstein]
Ok, I think you're right, for this sort of thing heapq is better. One could extend my priorityDictionary code to limit memory like this but it would be unnecessary work when the extra features it has over heapq are not used for this sort of algorithm.
I don't believe memory usage was an issue here. Take a look at the code again (comments removed):
""" q = priorityDictionary()
for dummy in xrange(N): q[Person(-1)] = -1
for person in people: if person.income > q.smallest().income: del q[q.smallest()] q[person] = person.income """
q starts with N entries. Each trip around the loop either leaves the q contents alone, or both removes and adds an entry. So the size of the dict is a loop invariant, len(q) == N. In the cases where it does remove an entry, it always removes the smallest entry, and the entry being added is strictly larger than that, so calling q.smallest() at the start of the next loop trip finds the just-deleted smallest entry still in self.__heap[0], and removes it. So the internal list may grow to N+1 entries immediately following
del q[q.smallest()]
but by the time we get to that line again it should be back to N entries again.
The reasons I found heapq easier to live with in this specific app had more to do with the subtleties involved in sidestepping potential problems with __hash__, __cmp__, and the speed of tuple comparison when the first tuple elements tie. heapq also supplies a "remove current smallest and replace with a new value" primitive, which happens to be just right for this app (that isn't an accident <wink>):
""" dummy = Person(-1) q = [dummy] * N
for person in people: if person > q[0]: heapq.heapreplace(q, person) """
On the other hand, if you really want to find the n best items in a data stream large enough that you care about using only space O(n), it might also be preferable to take constant amortized time per item rather than the O(log n) that heapq would use,
In practice, it's usually much faster than that. Over time, it gets rarer and rarer for
person > q[0]
to be true (the new person has to be larger than the N-th largest seen so far, and that bar gets raised whenever a new person manages to hurdle it), and the vast majority of sequence elements are disposed with via that single Python statement (the ">" test fails, and we move on to the next element with no heap operations). In the simplest case, if N==1, the incoming data is randomly ordered, and the incoming sequence has M elements, the if-test is true (on average) only ln(M) times (the expected number of left-to-right maxima). The order statistics get more complicated as N increases, of course, but in practice it remains very fast, and doing a heapreplace() on every incoming item is the worst case (achieved if the items come in sorted order; the best case is when they come in reverse-sorted order, in which case min(M, N) heapreplace() operations are done).
and it's not very difficult nor does it require any fancy data structures. Some time back I needed some Java code for this, haven't had an excuse to port it to Python. In case anyone's interested, it's online at http://www.ics.uci.edu/~eppstein/161/KBest.java. Looking at it now, it seems more complicated than it needs to be, but maybe that's just the effect of writing in Java instead of Python (I've seen an example of a three-page Java implementation of an algorithm in a textbook that could easily be done in a dozen Python lines).
Cool! I understood the thrust but not the details -- and I agree Java must be making it harder than it should be <wink>.
In python, this could be done without randomization as simply as
def addToNBest(L,x,N): L.append(x) if len(L) > 2*N: L.sort() del L[N:]
It's not constant amortized time due to the sort, but that's probably more than made up for due to the speed of compiled sort versus interpreted randomized pivot.
I'll attach a little timing script. addToNBest is done inline there, some low-level tricks were played to speed it, and it was changed to be a max N-best instead of a min N-best. Note that the list sort in 2.3 has a real advantage over Pythons before 2.3 here, because it recognizes (in linear time) that the first half of the list is already in sorted order (on the second & subsequent sorts), and leaves it alone until a final merge step with the other half of the array.
The relative speed (compared to the heapq code) varies under 2.3, seeming to depend mostly on M/N. The test case is set up to find the 1000 largest of a million random floats. In that case the sorting method takes about 3.4x longer than the heapq approach. As N gets closer to M, the sorting method eventually wins; when M and N are both a million, the sorting method is 10x faster. For most N-best apps, M is much smaller than N, and the heapq code should be quicker unless the data is already in order.
The relative speed (compared to the heapq code) varies under 2.3, seeming to depend mostly on M/N. The test case is set up to find the 1000 largest of a million random floats. In that case the sorting method takes about 3.4x longer than the heapq approach. As N gets closer to M, the sorting method eventually wins; when M and N are both a million, the sorting method is 10x faster. For most N-best apps, M is much smaller than N, and the heapq code should be quicker unless the data is already in order.
FWIW, there is C implementation of heapq at: http://zhar.net/projects/python/
Raymond Hettinger
On 5/4/03 10:20 PM -0400 Tim Peters tim.one@comcast.net wrote:
In practice, it's usually much faster than that. Over time, it gets rarer and rarer for
person > q[0]
to be true (the new person has to be larger than the N-th largest seen so far, and that bar gets raised whenever a new person manages to hurdle it),
Good point. If any permutation of the input sequence is equally likely, and you're selecting the best k out of n items, the expected number of times you have to hit the data structure in your heapq solution is roughly k ln n, so the total expected time is O(n + k log k log n), with a really small constant factor on the O(n) term. The sorting solution I suggested has total time O(n log k), and even though sorting is built-in and fast it can't compete when k is small. Random pivoting is O(n + k), but with a larger constant factor, so your heapq solution looks like a winner.
For fairness, it might be interesting to try another run of your test in which the input sequence is sorted in increasing order rather than random. I.e., replace the random generation of seq by seq = range(M) I'd try it myself, but I'm still running python 2.2 and haven't installed heapq. I'd have to know more about your application to have an idea whether the sorted or randomly-permuted case is more representative.
[Raymond Hettinger]
FWIW, there is C implementation of heapq at: http://zhar.net/projects/python/
Cool! I thought the code was remarkably clear, until I realized it never checked for errors (e.g., PyList_Append() can run out of memory, and PyObject_RichCompareBool() can raise any exception). Those would have to be repaired, and doing so would slow it some.
If the heapq module survives with the same API for a release or two, it would be a fine candidate to move into C, or maybe Pyrex (fiddly little integer arithmetic interspersed via if/then/else with trivial array indexing aren't Python's strong suits).
[David Eppstein, on the bar-raising behavior of
person > q[0] ]
Good point. If any permutation of the input sequence is equally likely, and you're selecting the best k out of n items, the expected number of times you have to hit the data structure in your heapq solution is roughly k ln n, so the total expected time is O(n + k log k log n), with a really small constant factor on the O(n) term. The sorting solution I suggested has total time O(n log k), and even though sorting is built-in and fast it can't compete when k is small. Random pivoting is O(n + k), but with a larger constant factor, so your heapq solution looks like a winner.
In real Python Life, it's the fastest way I know (depending ...).
For fairness, it might be interesting to try another run of your test in which the input sequence is sorted in increasing order rather than random.
Comparing the worst case of one against the best case of the other isn't my idea of fairness <wink>, but sure. On the best-1000 of a million floats test, and sorting the floats first, the heap method ran about 30x slower than on random data, and the sort method ran significantly faster than on random data (a factor of 1.3x faster). OTOH, if I undo my speed tricks and call a function in the sort method (instead of doing it all micro-optimized inline), that slows the sort method by a bit over a factor of 2.
I.e., replace the random generation of seq by seq = range(M) I'd try it myself, but I'm still running python 2.2 and haven't installed heapq. I'd have to know more about your application to have an idea whether the sorted or randomly-permuted case is more representative.
Of course -- so would I <wink>.
Here's a surprise: I coded a variant of the quicksort-like partitioning method, at the bottom of this mail. On the largest-1000 of a million random-float case, times were remarkably steady across trials (i.e., using a different set of a million random floats each time):
heapq 0.96 seconds sort (micro-optimized) 3.4 seconds KBest (below) 2.6 seconds
The KBest code creates new lists with wild abandon. I expect it does better than the sort method anyway because it gets to exploit its own form of "raise the bar" behavior as more elements come in. For example, on the first run, len(buf) exceeded 3000 only 14 times, and the final pivot value each time is used by put() as an "ignore the input unless it's bigger than that" cutoff:
pivoted w/ 0.247497558554 pivoted w/ 0.611006884768 pivoted w/ 0.633565558936 pivoted w/ 0.80516673256 pivoted w/ 0.814304890889 pivoted w/ 0.884660572175 pivoted w/ 0.89986744075 pivoted w/ 0.946575251872 pivoted w/ 0.980386533221 pivoted w/ 0.983743795382 pivoted w/ 0.992381911217 pivoted w/ 0.994243625292 pivoted w/ 0.99481443021 pivoted w/ 0.997044443344
The already-sorted case is also a bad case for this method, because then the pivot is never big enough to trigger the early exit in put().
def split(seq, pivot): lt, eq, gt = [], [], [] lta, eqa, gta = lt.append, eq.append, gt.append for x in seq: c = cmp(x, pivot) if c < 0: lta(x) elif c: gta(x) else: eqa(x) return lt, eq, gt
# KBest(k, minusinf) remembers the largest k objects # from a sequence of objects passed one at a time to # put(). minusinf must be smaller than any object # passed to put(). After feeding in all the objects, # call get() to retrieve a list of the k largest (or # as many as were passed to put(), if put() was called # fewer than k times).
class KBest(object): __slots__ = 'k', 'buflim', 'buf', 'cutoff'
def __init__(self, k, minusinf): self.k = k self.buflim = 3*k self.buf = [] self.cutoff = minusinf
def put(self, obj): if obj <= self.cutoff: return
buf = self.buf buf.append(obj) if len(buf) <= self.buflim: return
# Reduce len(buf) by at least one, by retaining # at least k, and at most len(buf)-1, of the # largest objects in buf. from random import choice sofar = [] k = self.k while len(sofar) < k: pivot = choice(buf) buf, eq, gt = split(buf, pivot) sofar.extend(gt) if len(sofar) < k: sofar.extend(eq[:k - len(sofar)])
self.buf = sofar self.cutoff = pivot
def get(self): from random import choice buf = self.buf k = self.k if len(buf) <= k: return buf
# Retain only the k largest. sofar = [] needed = k while needed: pivot = choice(buf) lt, eq, gt = split(buf, pivot) if len(gt) <= needed: sofar.extend(gt) needed -= len(gt) if needed: takefromeq = min(len(eq), needed) sofar.extend(eq[:takefromeq]) needed -= takefromeq # If we still need more, they have to # come out of things < pivot. buf = lt else: # gt alone is too large. buf = gt
assert len(sofar) == k self.buf = sofar return sofar
In article LNBBLJKPBEHFEDALKOLCIEAFEFAB.tim.one@comcast.net, Tim Peters tim.one@comcast.net wrote:
For fairness, it might be interesting to try another run of your test in which the input sequence is sorted in increasing order rather than random.
Comparing the worst case of one against the best case of the other isn't my idea of fairness <wink>, but sure.
Well, it doesn't seem any fairer to use random data to compare an algorithm with an average time bound that depends on an assumption of randomness in the data...anyway, the point was more to understand the limiting cases. If one algorithm is usually 3x faster than the other, and is never more than 10x slower, that's better than being usually 3x faster but sometimes 1000x slower, for instance.
I'd have to know more about your application to have an idea whether the sorted or randomly-permuted case is more representative.
Of course -- so would I <wink>.
My Java KBest code was written to make data subsets for a half-dozen web pages (same data selected according to different criteria). Of these six instances, one is presented the data in roughly ascending order, one in descending order, and the other four are less clear but probably not random.
Robustness in the face of this sort of variation is why I prefer any average-case assumptions in my code's performance to depend only on randomness from a random number generator, and not arbitrariness in the actual input. But I'm not sure I'd usually be willing to pay a 3x penalty for that robustness.
Here's a surprise: I coded a variant of the quicksort-like partitioning method, at the bottom of this mail. On the largest-1000 of a million random-float case, times were remarkably steady across trials (i.e., using a different set of a million random floats each time):
heapq 0.96 seconds sort (micro-optimized) 3.4 seconds KBest (below) 2.6 seconds
Huh. You're almost convincing me that asymptotic analysis works even in the presence of Python's compiled-vs-interpreted anomalies. The other surprise is that (unlike, say, the sort or heapq versions) your KBest doesn't look significantly more concise than my earlier Java implementation.
[David Eppstein]
For fairness, it might be interesting to try another run of your test in which the input sequence is sorted in increasing order rather than random.
[Tim]
Comparing the worst case of one against the best case of the other isn't my idea of fairness <wink>, but sure.
[David]
Well, it doesn't seem any fairer to use random data to compare an algorithm with an average time bound that depends on an assumption of randomness in the data...anyway, the point was more to understand the limiting cases. If one algorithm is usually 3x faster than the other, and is never more than 10x slower, that's better than being usually 3x faster but sometimes 1000x slower, for instance.
Sure. In practice you need to know time distributions when using an algorithm -- best, expected, worse, and how likely each are under a variety of expected conditions.
My Java KBest code was written to make data subsets for a half-dozen web pages (same data selected according to different criteria). Of these six instances, one is presented the data in roughly ascending order, one in descending order, and the other four are less clear but probably not random.
Robustness in the face of this sort of variation is why I prefer any average-case assumptions in my code's performance to depend only on randomness from a random number generator, and not arbitrariness in the actual input. But I'm not sure I'd usually be willing to pay a 3x penalty for that robustness.
Most people aren't, until they hit a bad case <0.5 wink>. So "pure" algorithms rarely survive in the face of a large variety of large problem instances. The monumental complications Python's list.sort() endures to work well under many conditions (both friendly and hostile) is a good example of that. In industrial real life, I expect an all-purpose N-Best queue would need to take a hybrid approach, monitoring its fast-path gimmick in some cheap way in order to fall back to a more defensive algorithm when the fast-path gimmick isn't paying.
Here's a surprise: I coded a variant of the quicksort-like partitioning method, at the bottom of this mail. On the largest-1000 of a million random-float case, times were remarkably steady across trials (i.e., using a different set of a million random floats each time):
heapq 0.96 seconds sort (micro-optimized) 3.4 seconds KBest (below) 2.6 seconds
Huh. You're almost convincing me that asymptotic analysis works even in the presence of Python's compiled-vs-interpreted anomalies.
Indeed, you can't fight the math! It often takes a large problem for better O() behavior to overcome a smaller constant in a worse O() approach, and especially in Python. For example, I once wrote and tuned and timed an O(N) worst-case rank algorithm in Python ("find the k'th smallest item in a sequence"), using the median-of-medians-of-5 business. I didn't have enough RAM at the time to create a list big enough for it to beat "seq.sort(); return seq[k]". By playing lots of tricks, and boosting it to median-of-medians-of-11, IIRC I eventually got it to run faster than sorting on lists with "just" a few hundred thousand elements.
But in *this* case I'm not sure that the only thing we're really measuring isn't:
1. Whether an algorithm has an early-out gimmick. 2. How effective that early-out gimmick is. and 3. How expensive it is to *try* the early-out gimmick.
The heapq method Rulz on random data because its answers then are "yes, very, dirt cheap". I wrote the KBest test like so:
def three(seq, N): NBest = KBest(N, -1e200) for x in seq: NBest.put(x) L = NBest.get() L.sort() return L
(the sort at the end is just so the results can be compared against the other methods, to ensure they all get the same answer). If I break into the abstraction and change the test like so:
def three(seq, N): NBest = KBest(N, -1e200) cutoff = -1e200 for x in seq: if x > cutoff: NBest.put(x) cutoff = NBest.cutoff L = NBest.get() L.sort() return L
then KBest is about 20% *faster* than heapq on random data. Doing the comparison inline avoids a method call when early-out pays, early-out pays more and more as the sequence nears its end, and simply avoiding the method call then makes the overall algorithm 3X faster. So O() analysis may triumph when equivalent low-level speed tricks are played (the heapq method did its early-out test inline too), but get swamped before doing so.
The other surprise is that (unlike, say, the sort or heapq versions) your KBest doesn't look significantly more concise than my earlier Java implementation.
The only thing I was trying to minimize was my time in whipping up something correct to measure. Still, I count 107 non-blank, non-comment lines of Java, and 59 of Python. Java gets unduly penalized for curly braces, Python for tedious tricks like
buf = self.buf k = self.k
to make locals for speed, and that I never put dependent code on the same line as an "if" or "while" test (while you do). Note that it's not quite the same algorithm: the Python version isn't restricted to ints, and in particular doesn't assume it can do arithmetic on a key to get "the next larger" key. Instead it does 3-way partitioning to find the items equal to the pivot. The greater generality may make the Python a little windier.
BTW, the heapq code isn't really more compact than C, if you count the implementation code in heapq.py too: it's all low-level small-int arithmetic and array indexing. The only real advantage Python has over assembler for code like that is that we can grow the list/heap dynamically without any coding effort.