Priority queue (binary heap) python code
I often find myself needing priority queues in python, and I've finally broken down and written a simple implementation. Previously I've used sorted lists (via bisect) to get the job done, but the heap code significantly improves performance. There are C based implementations, but the effort of compiling in an extension often isn't worth the effort. I'm including the code here for everyone's amusement. Any chance something like this could make it into the standard python library? It would save a lot of time for lazy people like myself. :-) Cheers, -Kevin def heappush(heap, item): pos = len(heap) heap.append(None) while pos: parentpos = (pos - 1) / 2 parent = heap[parentpos] if item <= parent: break heap[pos] = parent pos = parentpos heap[pos] = item def heappop(heap): endpos = len(heap) - 1 if endpos <= 0: return heap.pop() returnitem = heap[0] item = heap.pop() pos = 0 while 1: child2pos = (pos + 1) * 2 child1pos = child2pos - 1 if child2pos < endpos: child1 = heap[child1pos] child2 = heap[child2pos] if item >= child1 and item >= child2: break if child1 > child2: heap[pos] = child1 pos = child1pos continue heap[pos] = child2 pos = child2pos continue if child1pos < endpos: child1 = heap[child1pos] if child1 > item: heap[pos] = child1 pos = child1pos break heap[pos] = item return returnitem -- ------------------------------------------------------------------------ | Kevin O'Connor "BTW, IMHO we need a FAQ for | | kevin@koconnor.net 'IMHO', 'FAQ', 'BTW', etc. !" | ------------------------------------------------------------------------
Kevin> I often find myself needing priority queues in python, and I've Kevin> finally broken down and written a simple implementation. Hmmm... I don't see a priority associated with items when you push them onto the queue in heappush(). This seems somewhat different than my notion of a priority queue. Seems to me that you could implement the type of priority queue I'm think of rather easily using a class that wraps a list of Queue.Queue objects. Am I missing something obvious? -- Skip Montanaro skip@pobox.com consulting: http://manatee.mojam.com/~skip/resume.html
On Mon, Jun 24, 2002 at 09:00:49PM -0500, Skip Montanaro wrote:
Kevin> I often find myself needing priority queues in python, and I've Kevin> finally broken down and written a simple implementation.
Hmmm... I don't see a priority associated with items when you push them onto the queue in heappush(). This seems somewhat different than my notion of a priority queue.
Hi Skip, I should have included a basic usage in my original email:
t = []; heappush(t, 10); heappush(t, 20); heappush(t, 15); heappush(t, 5) print heappop(t), heappop(t), heappop(t), heappop(t) 20 15 10 5
The binary heap has the property that pushing takes O(log n) time and popping takes O(log n) time. One may push in any order and a pop() always returns the greatest item in the list. I don't explicitly associate a priority with every item in the queue - instead I rely on the user having a __cmp__ operator defined on the items (if the default does not suffice). The same behavior can be obtained using sorted lists:
from bisect import insort t = []; insort(t, 10); insort(t, 20); insort(t, 15); insort(t, 5) print t.pop(), t.pop(), t.pop(), t.pop() 20 15 10 5
But insort takes a lot more overhead on large lists.
Seems to me that you could implement the type of priority queue I'm think of rather easily using a class that wraps a list of Queue.Queue objects. Am I missing something obvious?
Perhaps I am, because I do not see how one would use Queue.Queue efficiently for this task. Cheers, -Kevin -- ------------------------------------------------------------------------ | Kevin O'Connor "BTW, IMHO we need a FAQ for | | kevin@koconnor.net 'IMHO', 'FAQ', 'BTW', etc. !" | ------------------------------------------------------------------------
Kevin> I don't explicitly associate a priority with every item in the Kevin> queue - instead I rely on the user having a __cmp__ operator Kevin> defined on the items (if the default does not suffice). That's what I missed. >> Seems to me that you could implement the type of priority queue I'm >> think of rather easily using a class that wraps a list of Queue.Queue >> objects. Am I missing something obvious? Kevin> Perhaps I am, because I do not see how one would use Queue.Queue Kevin> efficiently for this task. I don't know how efficient it would be, but I usually think that most applications have a small, fixed set of possible priorities, like ("low", "medium", "high") or ("info", "warning", "error", "fatal"). In this sort of situation my initial inclination would be to implement a dict of Queue instances which corresponds to the fixed set of priorities, something like: import Queue class PriorityQueue: def __init__(self, priorities): self.queues = {} self.marker = Queue.Queue() self.priorities = priorities for p in priorities: self.queues[p] = Queue.Queue() def put(self, obj, priority): self.queues[priority].put(obj) self.marker.put(None) def get(self): dummy = self.marker.get() # at this point we know one of the queues has an entry for us for p in self.priorities: try: return self.queues[p].get_nowait() except Queue.Empty: pass if __name__ == "__main__": q = PriorityQueue(("low", "medium", "high")) q.put(12, "low") q.put(13, "high") q.put(14, "medium") print q.get() print q.get() print q.get() Obviously this won't work if your set of priorities isn't fixed at the outset, but I think it's pretty straightforward, and it should work in multithreaded applications. It will also work if for some reason you want to queue up objects for which __cmp__ doesn't make sense. Skip
Skip Montanaro <skip@pobox.com>:
I don't know how efficient it would be, but I usually think that most applications have a small, fixed set of possible priorities
Some applications of priority queues are like that, but others aren't -- e.g. an event queue in a discrete event simulation, where events are ordered by time. I expect that's the sort of application Kevin had in mind. Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | A citizen of NewZealandCorp, a | Christchurch, New Zealand | wholly-owned subsidiary of USA Inc. | greg@cosc.canterbury.ac.nz +--------------------------------------+
On Tuesday 25 June 2002 06:09 am, Skip Montanaro wrote: ...
I don't know how efficient it would be, but I usually think that most applications have a small, fixed set of possible priorities, like ("low", "medium", "high") or ("info", "warning", "error", "fatal"). In this sort
Then you do "bin sorting", of course -- always worth considering when you know the sort key can only take a small number of different values (as is the more general "radix sorting" when you have a few such keys, or a key that easily breaks down that way). But it IS rather a special case, albeit an important one (and quite possibly frequently occurring in some application areas). Alex
I don't know how efficient it would be, but I usually think that most applications have a small, fixed set of possible priorities, like ("low", "medium", "high") or ("info", "warning", "error", "fatal"). In this sort of situation my initial inclination would be to implement a dict of Queue instances which corresponds to the fixed set of priorities, something like:
If priority queues were to be included, I'd rather add the necessary support in Queue to easily attach priority handling, if that's not already possible. Maybe adding a generic **kw parameter, and passing it to _put() could help a bit. The applications of a priority Queue I've used until now weren't able to use your approach. OTOH, there are many cases where you're right, and we could benefit from this. If it's of common sense that priority queues are that useful, we should probably add one or two subclasses of Queue in the Queue module (one with your approach and one with the more generic one). Otherwise, subclassing Queue is already easy enough, IMO (adding the **kw suggestion would avoid overloading put(), and seems reasonable to me). Thanks! -- Gustavo Niemeyer [ 2AAC 7928 0FBF 0299 5EB5 60E2 2253 B29A 6664 3A0C ]
Gustavo Niemeyer wrote:
If priority queues were to be included, I'd rather add the necessary support in Queue to easily attach priority handling, if that's not already possible.
it takes a whopping four lines of code, if you're a pragmatic programmer: # # implementation import Queue, bisect class PriorityQueue(Queue.Queue): def _put(self, item): bisect.insort(self.queue, item) # # usage queue = PriorityQueue(0) queue.put((2, "second")) queue.put((1, "first")) queue.put((3, "third")) priority, value = queue.get() </F>
it takes a whopping four lines of code, if you're a pragmatic programmer:
Indeed. Using a tuple directly was a nice idea! I was thinking about a priority parameter (maybe I'm not that pragmatic? ;-), which is not hard as well, but one will have to overload the put method to pass the priority parameter. import Queue, bisect class PriorityQueue(Queue.Queue): def __init__(self, maxsize=0, defaultpriority=0): self.defaultpriority = defaultpriority Queue.Queue.__init__(self, maxsize) def put(self, item, block=1, **kw): if block: self.fsema.acquire() elif not self.fsema.acquire(0): raise Full self.mutex.acquire() was_empty = self._empty() # <- Priority could be handled here as well. self._put(item, **kw) if was_empty: self.esema.release() if not self._full(): self.fsema.release() self.mutex.release() def _put(self, item, **kw): # <- But here seems better priority = kw.get("priority", self.defaultpriority) bisect.insort(self.queue, (priority, item)) def _get(self): return self.queue.pop(0)[1] -- Gustavo Niemeyer [ 2AAC 7928 0FBF 0299 5EB5 60E2 2253 B29A 6664 3A0C ]
Gustavo wrote:
def put(self, item, block=1, **kw): if block: self.fsema.acquire() elif not self.fsema.acquire(0): raise Full self.mutex.acquire() was_empty = self._empty() # <- Priority could be handled here as well. self._put(item, **kw) if was_empty: self.esema.release() if not self._full(): self.fsema.release() self.mutex.release()
def _put(self, item, **kw): # <- But here seems better priority = kw.get("priority", self.defaultpriority) bisect.insort(self.queue, (priority, item))
or better: def put(self, item, block=1, priority=None): if priority is None: priority = self.defaultpriority Queue.Queue.put(self, (priority, item), block) </F>
On Mon, Jun 24, 2002 at 11:09:32PM -0500, Skip Montanaro wrote:
I don't know how efficient it would be, but I usually think that most applications have a small, fixed set of possible priorities, like ("low", "medium", "high") or ("info", "warning", "error", "fatal"). In this sort of situation my initial inclination would be to implement a dict of Queue instances which corresponds to the fixed set of priorities, something like:
Hi Skip, The application I had in mind stored between 100,000-1,000,000 objects with priorities between 0-150. I found that moving from bisect to a heap improved performance of the entire program by about 25%.
It will also work if for some reason you want to queue up objects for which __cmp__ doesn't make sense.
I just assumed the user would use the (priority, data) tuple trick at the start (it does make the algorithm simpler). In a way, the code is very similar to the way the bisect module is implemented. -Kevin -- ------------------------------------------------------------------------ | Kevin O'Connor "BTW, IMHO we need a FAQ for | | kevin@koconnor.net 'IMHO', 'FAQ', 'BTW', etc. !" | ------------------------------------------------------------------------
On Mon, Jun 24, 2002 at 09:33:18PM -0400, Kevin O'Connor wrote:
I often find myself needing priority queues in python, and I've finally broken down and written a simple implementation. Previously I've used sorted lists (via bisect) to get the job done, but the heap code significantly improves performance. There are C based implementations, but the effort of compiling in an extension often isn't worth the effort. I'm including the code here for everyone's amusement.
Any chance something like this could make it into the standard python library? It would save a lot of time for lazy people like myself. :-)
A sorted list is a much more general-purpose data structure than a priority queue and can be used to implement a priority queue. It offers almost the same asymptotic performance: sorted list using splay tree (amortized): insert: O(log n) pop: O(log n) peek: O(log n) priority queue using binary heap: insert: O(log n) pop: O(log n) peek: O(1) The only advantage of a heap is O(1) peek which doesn't seem so critical. It may also have somewhat better performance by a constant factor because it uses an array rather than allocating node structures. But the internal order of a heap-based priority queue is very non-intuitive and quite useless for other purposes while a sorted list is, umm..., sorted! Oren
Oren Tirosh <oren-py-d@hishome.net> writes:
The only advantage of a heap is O(1) peek which doesn't seem so critical. It may also have somewhat better performance by a constant factor because it uses an array rather than allocating node structures. But the internal order of a heap-based priority queue is very non-intuitive and quite useless for other purposes while a sorted list is, umm..., sorted!
I think that heaps don't allocate additional memory is a valuable property, more valuable than the asymptotic complexity (which is also quite good). If you don't want to build priority queues, you can still use heaps to sort a list. IMO, heaps are so standard as an algorithm that they belong into the Python library, in some form. It is then the user's choice to use that algorithm or not. Regards, Martin
On Tue, Jun 25, 2002 at 09:23:43AM +0200, Martin v. Loewis wrote:
Oren Tirosh <oren-py-d@hishome.net> writes:
The only advantage of a heap is O(1) peek which doesn't seem so critical. It may also have somewhat better performance by a constant factor because it uses an array rather than allocating node structures. But the internal order of a heap-based priority queue is very non-intuitive and quite useless for other purposes while a sorted list is, umm..., sorted!
I think that heaps don't allocate additional memory is a valuable property, more valuable than the asymptotic complexity (which is also quite good). If you don't want to build priority queues, you can still use heaps to sort a list.
When I want to sort a list I just use .sort(). I don't care which algorithm is used. I don't care whether dictionaries are implemented using hash tables, some kind of tree structure or magic smoke. I just trust Python to use a reasonably efficient implementation. I always find it funny when C++ or Perl programmers refer to an associative array as a "hash".
IMO, heaps are so standard as an algorithm that they belong into the Python library, in some form. It is then the user's choice to use that algorithm or not.
Heaps are a "standard algorithm" only from a CS point of view. It doesn't have much to do with everyday programming. Let's put it this way: If Python has an extension module in the standard library implementing a sorted list, would you care enough about the specific binary heap implementation to go and write one or would you just use what you had in the library for a priority queue? ;-) Oren
Oren Tirosh <oren-py-d@hishome.net> writes:
When I want to sort a list I just use .sort(). I don't care which algorithm is used. I don't care whether dictionaries are implemented using hash tables, some kind of tree structure or magic smoke. I just trust Python to use a reasonably efficient implementation.
And nobody says you should think differently.
I always find it funny when C++ or Perl programmers refer to an associative array as a "hash".
I agree.
Heaps are a "standard algorithm" only from a CS point of view. It doesn't have much to do with everyday programming.
This has many different reasons: In the case of Python, the standard .sort is indeed good for most applications. In general (including Python), usage of heapsort is rare since it is difficult to implement and not part of the standard library. Likewise, the naive priority queue implementation is good in most cases. If it was more easy to use, I assume it would be used more often.
Let's put it this way: If Python has an extension module in the standard library implementing a sorted list, would you care enough about the specific binary heap implementation to go and write one or would you just use what you had in the library for a priority queue? ;-)
I don't understand this question: Why do I have to implement anything? Having heapsort in the library precisely means that I do not have to write an implementation. Regards, Martin
On Tue, Jun 25, 2002, Martin v. Loewis wrote:
IMO, heaps are so standard as an algorithm that they belong into the Python library, in some form. It is then the user's choice to use that algorithm or not.
Should this PEP be split in two, then? One for a new "AbstractData" package (that would include the heap algorithm) and one for an update to Queue that would use some algorithm from AbstractData. The latter might not even need a PEP. -- Aahz (aahz@pythoncraft.com) <*> http://www.pythoncraft.com/ Project Vote Smart: http://www.vote-smart.org/
Aahz <aahz@pythoncraft.com> writes:
Should this PEP be split in two, then? One for a new "AbstractData" package (that would include the heap algorithm) and one for an update to Queue that would use some algorithm from AbstractData. The latter might not even need a PEP.
I don't know. The author of the PEP would have the freedom to propose anything initially. Depending on the proposal, people will comment, then reorganizations might be necessary. Regards, Martin
[Aahz]
Should this PEP be split in two, then? One for a new "AbstractData" package (that would include the heap algorithm) and one for an update to Queue that would use some algorithm from AbstractData. The latter might not even need a PEP.
I'm chuckling, but to myself <wink>. By the time you add all the bells and whistles everyone may want out of "a priority queue", the interface gets so frickin' complicated that almost everyone will ignore the library and call bisect.insort() themself. /F gives me a thread-safe Queue when I don't want to pay overheads for enforcing mutual exclusion in 99% of my priority-queue apps. Schemes that store (priority, object) tuples to exploit lexicographic comparison are convenient to code but a nightmare if priorities can ever be equal, and object comparison can raise exceptions, or object comparison can be expensive. Sometimes I want a min-queue, other times a max-queue. Sometimes I need efficient access to both ends. About a month ago I needed to write a priority queue that was especially efficient at adding thousands of new entries in one gulp. And so on. It's easier to write appropriate code from scratch in Python than to figure out how to *use* a package profligate enough to contain canned solutions for all common and reasonable use cases. People have been known to gripe at the number of methods Python's simple little lists and dicts have sprouted -- heh heh. BTW, the Zope BTree may be a good candidate to fold into Python. I'm not sure. It's a mountain of fairly sophisticated code with an interface so rich that it's genuinely hard to learn how to use it as intended -- the latter especially should appeal to just about everyone <wink>.
On Wed, Jun 26, 2002, Tim Peters wrote:
[Aahz]
Should this PEP be split in two, then? One for a new "AbstractData" package (that would include the heap algorithm) and one for an update to Queue that would use some algorithm from AbstractData. The latter might not even need a PEP.
I'm chuckling, but to myself <wink>. By the time you add all the bells and whistles everyone may want out of "a priority queue", the interface gets so frickin' complicated that almost everyone will ignore the library and call bisect.insort() themself.
Fair enough -- but I didn't really know about bisect myself. Looking at the docs for bisect, it says that the code might be best used as a source code example. I think that having a package to dump similar kinds of code might be a good idea. It's not a substitute for a CS course, but...
And so on. It's easier to write appropriate code from scratch in Python than to figure out how to *use* a package profligate enough to contain canned solutions for all common and reasonable use cases. People have been known to gripe at the number of methods Python's simple little lists and dicts have sprouted -- heh heh.
Actually, I was expecting that the Queue PEP would be dropped once the AbstractData package got some momentum behind. I was just trying to be a tiny bit subtle. <wink> -- Aahz (aahz@pythoncraft.com) <*> http://www.pythoncraft.com/ Project Vote Smart: http://www.vote-smart.org/
aahz> Fair enough -- but I didn't really know about bisect myself. aahz> Looking at the docs for bisect, it says that the code might be aahz> best used as a source code example. I always forget about it as well. I just added /F's four-line PriorityQueue class as an example in the bisect docs and a "seealso" pointing at the bisect doc to the Queue module doc. Skip
Also, in case nobody has said so, worst-case performance for insertion into a large heap (log N) is much better than for insertion into a sorted list (N). Of course, in practice, it takes a really large heap to notice these effects. -Dave From: "Martin v. Loewis" <martin@v.loewis.de>
Oren Tirosh <oren-py-d@hishome.net> writes:
The only advantage of a heap is O(1) peek which doesn't seem so critical. It may also have somewhat better performance by a constant factor because it uses an array rather than allocating node structures. But the internal order of a heap-based priority queue is very non-intuitive and quite useless for other purposes while a sorted list is, umm..., sorted!
I think that heaps don't allocate additional memory is a valuable property, more valuable than the asymptotic complexity (which is also quite good). If you don't want to build priority queues, you can still use heaps to sort a list.
IMO, heaps are so standard as an algorithm that they belong into the Python library, in some form. It is then the user's choice to use that algorithm or not.
Regards, Martin
On Tue, Jun 25, 2002 at 02:52:03AM -0400, Oren Tirosh wrote:
Any chance something like this could make it into the standard python library? It would save a lot of time for lazy people like myself. :-)
A sorted list is a much more general-purpose data structure than a priority queue and can be used to implement a priority queue. It offers almost the same asymptotic performance:
Hi Oren, I agree that some form of a balanced tree object would be more useful, but unfortunately it doesn't exist natively. A pure python implementation of heaps is a pretty straight-forward addition. If, however, one were to consider adding C code then I would agree a tree object would be more valuable. As you surmised later, I wouldn't have bothered with a heap if trees were available. In fact, I've always wondered why Python dictionaries use the hash algorithm instead of the more general binary tree algorithm. :-} -Kevin -- ------------------------------------------------------------------------ | Kevin O'Connor "BTW, IMHO we need a FAQ for | | kevin@koconnor.net 'IMHO', 'FAQ', 'BTW', etc. !" | ------------------------------------------------------------------------
[Kevin O'Connor]
... In fact, I've always wondered why Python dictionaries use the hash algorithm instead of the more general binary tree algorithm. :-}
Speed. Zope and StandaloneZODB have a BTree package, which I've recently spent a good amount of time optimizing. Here's a timing driver: """ from time import clock as now N = 1000000 indices = range(N) def doit(constructor): d = constructor() t1 = now() for i in indices: d[i] = i t2 = now() for i in indices: assert d[i] == i t3 = now() for i in indices: del d[i] t4 = now() return t2-t1, t3-t2, t4-t3 def drive(constructor, n): print "Using", constructor.__name__, "on", N, "entries" for i in range(n): d1, d2, d3 = doit(constructor) print "construct %6.3f" % d1 print "query %6.3f" % d2 print "remove %6.3f" % d3 def dict(): return {} from BTrees.OOBTree import OOBTree drive(OOBTree, 3) drive(dict, 3) """ This is a little strained because I'm running it under Python 2.1.3. This favors the BTrees, because I also spent a lot of time optimizing Python's dicts for the Python 2.2 release; 2.1 doesn't have that stuff. OOBTrees are most similar to Python's dicts, mapping objects to objects. Here's a run: Using OOBTree on 1000000 entries construct 5.376 query 5.571 remove 4.065 construct 5.349 query 5.610 remove 4.211 construct 5.363 query 5.585 remove 4.374 Using dict on 1000000 entries construct 1.411 query 1.336 remove 0.780 construct 1.382 query 1.335 remove 0.781 construct 1.376 query 1.334 remove 0.778 There's just no contest here. BTrees have many other virtues, like supporting range searches, and automatically playing nice with ZODB persistence, but they're plain sluggish compared to dicts. To be completely fair and unfair at the same time <wink>, there are also 4 other flavors of Zope BTree, purely for optimization reasons. In particular, the IIBTree maps Python ints to Python ints, and does so by avoiding Python int objects altogether, storing C longs directly and comparing them at native "compare a long to a long" C speed. That's *almost* as fast as Python 2.1 int->int dicts (which endure all-purpose Python object comparison), except for deletion (the BTree spends a lot of time tearing apart all the tree pointers again). Now that's a perfectly height-balanced search tree that "chunks up" blocks of keys for storage and speed efficiency, and rarely needs more than a simple local adjustment to maintain balance. I expect that puts it at the fast end of what can be achieved with a balanced tree scheme. The ABC language (which Guido worked on before Python) used AVL trees for just about everything under the covers. It's not a coincidence that Python doesn't use balanced trees for anything <wink>.
There's just no contest here. BTrees have many other virtues, like supporting range searches, and automatically playing nice with ZODB persistence, but they're plain sluggish compared to dicts. To be completely fair and unfair at the same time <wink>, there are also 4 other flavors of Zope BTree, purely for optimization reasons. In particular, the IIBTree maps Python ints to Python ints, and does so by avoiding Python int objects altogether, storing C longs directly and comparing them at native "compare a long to a long" C speed. That's *almost* as fast as Python 2.1 int->int dicts (which endure all-purpose Python object comparison), except for deletion (the BTree spends a lot of time tearing apart all the tree
again).
Now that's a perfectly height-balanced search tree that "chunks up" blocks of keys for storage and speed efficiency, and rarely needs more than a simple local adjustment to maintain balance. I expect that puts it at
This is really interesting. When I was at Dragon (well, actually, after Tim left and it became L&H), I ported my natural language parsing/understanding system from Python to C++ so it could run quickly enough for embedded devices. The core of this system was an associative container, so I knew that its performance would be crucial. I used C++ generics which made it really easy to swap in different associative container implementations, and I tried lots, including the red-black tree containers built into most C++ implementations, and hash tables. My experience was that trying to come up with a hash function that would give a significant speed increases over the tree containers was extremely difficult, because it was really hard to come up with a good hash function. Furthermore, even if I succeeded, it was like black magic: it was inconsistent accross my test cases and there was no way to understand why it worked well, and to get a feeling for how it would scale to problems outside those cases. I ended up hand-coding a two-level scheme based on binary searches in contiguous arrays which blew away anything I'd been able to do with a hash table. My conclusion was that for general-purpose use, the red-black tree was pretty good, despite its relatively high memory overhead of 3 pointers per node: it places easy requirements on the user (supply a strick weak ordering) and provides predictable and smooth performance even asymptotically. On the other hand, hashing requires that the user supply both a hash function and an equality detector which must agree with one-another, requires hand-tuning of the hash function for performance, and is rather more unpredictable. We've been talking about adding hash-based containers to the C++ standard library but I'm reluctant on these grounds. It seems to me that when you really care about speed, some kind of hand-coded solution might be a better investment than trying to come up with a good hash function. I'm ready to believe that hashing is the most appropriate choice for Python, but I wonder what makes the difference? -Dave From: "Tim Peters" <tim.one@comcast.net> pointers the
fast end of what can be achieved with a balanced tree scheme.
The ABC language (which Guido worked on before Python) used AVL trees for just about everything under the covers. It's not a coincidence that Python doesn't use balanced trees for anything <wink>.
_______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev
[David Abrahams]
This is really interesting. When I was at Dragon (well, actually, after Tim left and it became L&H), I ported my natural language parsing/understanding system from Python to C++ so it could run quickly enough for embedded devices. The core of this system was an associative container, so I knew that its performance would be crucial. I used C++ generics which made it really easy to swap in different associative container implementations, and I tried lots, including the red-black tree containers built into most C++ implementations, and hash tables. My experience was that trying to come up with a hash function that would give a significant speed increases over the tree containers was extremely difficult, because it was really hard to come up with a good hash function.
There's more to a speedy hash implementation than just the hash function, of course.
Furthermore, even if I succeeded, it was like black magic: it was < inconsistent accross my test cases and there was no way to understand why it worked well, and to get a feeling for how it would scale to problems outside those cases.
Python's dictobject.c and .h have extensive comments about how Python's dicts work. Their behavior isn't mysterious, at least not after 10 years of thinking about it <0.9 wink>. Python's dicts also use tricks that I've never seen in print -- many people have contributed clever tricks.
I ended up hand-coding a two-level scheme based on binary searches in contiguous arrays which blew away anything I'd been able to do with a hash table. My conclusion was that for general-purpose use, the red- black tree was pretty good, despite its relatively high memory overhead of 3 pointers per node:
The example I posted built a mapping with a million entries. A red-black tree of that size needs to chase between 20 and 40 pointers to determine membership. By sheer instruction count alone, that's way more instructions than the Python dict usually has to do, although it's comparable to the number the BTree had to do. The BTree probably has better cache behavior than a red-black tree; for example, all the keys in the million-element example were on the third level, and a query required exactly two pointer-chases to get to the right leaf bucket. All the rest is binary search on 60-120 element contiguous vectors (in fact, sounds a lot like your custom "two-level scheme")
it places easy requirements on the user (supply a strick weak ordering) and provides predictable and smooth performance even asymptotically.
OTOH, it can be very hard to write an efficient, correct "<" ordering, while testing just "equal or not?" can be easier and run quicker than that. Dict comparison is a good example from the Python core: computing "<" for dicts is a nightmare, but computing "==" for dicts is easy (contrast the straightforward dict_equal() with the brain-busting dict_compare() + characterize() pair). This was one of the motivations for introducing "rich comparisons".
On the other hand, hashing requires that the user supply both a hash function and an equality detector which must agree with one-another,
I've rarely found this to be a challenge. For example, for sets that contain sets as elements, a suitable hash function can simply xor the hash codes of the set elements. Since equal sets have equal elements, such a scheme delivers equal hash codes for equal sets, and independent of the order in which set elements get enumerated. In contrast, defining a sensible *total* ordering on sets is a delicate undertaking (yes, I know that "strict weak ordering" is weaker than "total", but in real life you couldn't buy a hot dog with the difference <wink>).
requires hand-tuning of the hash function for performance, and is rather more unpredictable. We've been talking about adding hash-based containers to the C++ standard library but I'm reluctant on these grounds. It seems to me that when you really care about speed, some kind of hand-coded solution might be a better investment than trying to come up with a good hash function.
I'm ready to believe that hashing is the most appropriate choice for Python, but I wonder what makes the difference?
Well, I'm intimately familar with the details of how Python dicts and Zope BTrees are implemented, down to staring at the machine instructions generated, and there's no mystery here to me. I'm not familiar with any of the details of what you tried. Understanding speed differences at this level isn't a "general principles" kind of discussion. I should note that Zope's BTrees pay a lot for playing nice with persistence, about a factor of two: upon visiting and leaving each BTree node, there are messy test+branch sequences ensuring that the object isn't a ghost, notifying the persistence machinery that fields have been accessed and/or changed, and telling the persistence machinery when the object is no longer in active use. Most of these bookkeeping operations can fail too, so there's another layer of "and did that fail?" test+branches around all that. The saving grace for BTrees (why this doesn't cost a factor of, say, 10) is that each BTree node contains a fair amount of "stuff", so that the guts of each function can do a reasonable amount of useful work. The persistence overhead could be a disaster if visiting an object only moved one bit closer to the result. But Python's dicts aren't aware of persistence at all, and that did give dicts an ~= factor-of-2 advantage in the example. While they're still not as zippy as dicts after factoring that out, B-Trees certainly aren't pigs. BTW, note that Python's dicts originally catered only to string keys, as they were the implementation of Python's namespaces, and dicts remain highly optimized for that specific purpose. Indeed, there's a distinct dict lookup routine dedicated to dicts with string keys. Namespaces have no compelling use for range search or lexicographic traversal, just association, and peak lookup speed was the design imperative.
From: "Tim Peters" <tim@zope.com>
There's more to a speedy hash implementation than just the hash function, of course.
'course.
Furthermore, even if I succeeded, it was like black magic: it was < inconsistent accross my test cases and there was no way to understand why it worked well, and to get a feeling for how it would scale to problems outside those cases.
Python's dictobject.c and .h have extensive comments about how Python's dicts work. Their behavior isn't mysterious, at least not after 10 years of thinking about it <0.9 wink>. Python's dicts also use tricks that I've never seen in print -- many people have contributed clever tricks.
I noticed that, and I think the next time I try hashing I'm going to steal as much as possible from Python's implementation to get a head start. Noticing that also left me with a question: how come everybody in the world hasn't stolen as much as possible from the Python hashing implementation? Are there a billion such 10-years'-tweaked implementations lying around which all perform comparably well?
The example I posted built a mapping with a million entries. A red-black tree of that size needs to chase between 20 and 40 pointers to determine membership. By sheer instruction count alone, that's way more instructions than the Python dict usually has to do, although it's comparable to the number the BTree had to do. The BTree probably has better cache behavior than a red-black tree; for example, all the keys in the million-element example were on the third level, and a query required exactly two pointer-chases to get to the right leaf bucket. All the rest is binary search on 60-120 element contiguous vectors (in fact, sounds a lot like your custom "two-level scheme")
Yeah, I think it ended up being something like that. Of course, the container I ended up with used domain-specific knowledge which would have been inappropriate for general-purpose use.
it places easy requirements on the user (supply a strick weak ordering) and provides predictable and smooth performance even asymptotically.
OTOH, it can be very hard to write an efficient, correct "<" ordering, while testing just "equal or not?" can be easier and run quicker than that. Dict comparison is a good example from the Python core: computing "<" for dicts is a nightmare, but computing "==" for dicts is easy (contrast the straightforward dict_equal() with the brain-busting dict_compare() + characterize() pair).
Well, OK, ordering hash tables is hard, unless the bucket count is a deterministic function of the element count. If they were sorted containers, of course, < would be a simple matter. And I assume that testing equality still involves a lot of hashing... Hmm, looking at the 3 C++ implementations of hashed containers that I have available to me, only one provides operator<(), which is rather strange since the other two implement operator== by first comparing sizes, then iterating through consecutive elements of each set looking for a difference. The implementation supplying operator<() uses a (IMO misguided) design that rehashes incrementally, but it seems to me that if the more straightforward approaches can implement operator==() as described, operator<() shouldn't have to be a big challenge for an everyday hash table. I'm obviously missing something, but what...?
This was one of the motivations for introducing "rich comparisons".
I don't see how that helps. Got a link? Or a clue?
On the other hand, hashing requires that the user supply both a hash function and an equality detector which must agree with one-another,
I've rarely found this to be a challenge. For example, for sets that contain sets as elements, a suitable hash function can simply xor the hash codes of the set elements. Since equal sets have equal elements, such a scheme delivers equal hash codes for equal sets, and independent of the order in which set elements get enumerated. In contrast, defining a sensible *total* ordering on sets is a delicate undertaking (yes, I know that "strict weak ordering" is weaker than "total", but in real life you couldn't buy a hot dog with the difference <wink>).
I don't know what that means. If you represent your sets as sorted containers, getting a strict weak ordering on sets is trivial; you just do it with a lexicographical comparison of the two sequences.
I'm ready to believe that hashing is the most appropriate choice for Python, but I wonder what makes the difference?
Well, I'm intimately familar with the details of how Python dicts and Zope BTrees are implemented, down to staring at the machine instructions generated, and there's no mystery here to me. I'm not familiar with any of the details of what you tried. Understanding speed differences at this level isn't a "general principles" kind of discussion.
I should note that Zope's BTrees pay a lot for playing nice with persistence, about a factor of two: upon visiting and leaving each BTree node, there are messy test+branch sequences ensuring that the object isn't a ghost, notifying the persistence machinery that fields have been accessed and/or changed, and telling the persistence machinery when the object is no longer in active use. Most of these bookkeeping operations can fail too, so there's another layer of "and did that fail?" test+branches around all
The saving grace for BTrees (why this doesn't cost a factor of, say, 10) is that each BTree node contains a fair amount of "stuff", so that the guts of each function can do a reasonable amount of useful work. The persistence overhead could be a disaster if visiting an object only moved one bit closer to the result.
But Python's dicts aren't aware of persistence at all, and that did give dicts an ~= factor-of-2 advantage in the example. While they're still not as zippy as dicts after factoring that out, B-Trees certainly aren't
No, I suppose not. But python's dicts are general-purpose containers, and you can put any key you like in there. It's still surprising to me given my (much less than 10 years') experience with hash implementations that you can design something that performs well over all those different cases. that. Aww, heck, you just need a good C++ exception-handling implementation to get rid of the error-checking overheads ;-) pigs.
BTW, note that Python's dicts originally catered only to string keys, as they were the implementation of Python's namespaces, and dicts remain
optimized for that specific purpose. Indeed, there's a distinct dict lookup routine dedicated to dicts with string keys. Namespaces have no compelling use for range search or lexicographic traversal, just association, and
highly peak
lookup speed was the design imperative.
Thanks for the perspective! still-learning-ly y'rs, dave
On 26 Jun 2002 at 19:20, David Abrahams wrote: [Python's hashing]
Noticing that also left me with a question: how come everybody in the world hasn't stolen as much as possible from the Python hashing implementation? Are there a billion such 10-years'-tweaked implementations lying around which all perform comparably well?
Jean-Claude Wippler and Christian Tismer did some benchmarks against other implementations. IIRC, the only one in the same ballpark was Lua's (which, IIRC, was faster at least under some conditions). -- Gordon http://www.mcmillan-inc.com/
[David Abrahams]
Noticing that also left me with a question: how come everybody in the world hasn't stolen as much as possible from the Python hashing implementation? Are there a billion such 10-years'-tweaked implementations lying around which all perform comparably well?
[Gordon McMillan]
Jean-Claude Wippler and Christian Tismer did some benchmarks against other implementations. IIRC, the only one in the same ballpark was Lua's (which, IIRC, was faster at least under some conditions).
I'd like to see the benchmark. Like Python, Lua uses a power-of-2 table size, but unlike Python uses linked lists for collisions instead of open addressing. This appears to leave it very vulnerable to bad cases (like using [i << 16 for i in range(20000)] as a set of keys -- Python and Lua both grab the last 15 bits of the ints as their hash codes, which means every key maps to the same hash bucket. Looks like Lua would chain them all together. Python breaks the ties quickly via its collision resolution scrambling.). The Lua string hash appears systematically vulnerable: static unsigned long hash_s (const char *s, size_t l) { unsigned long h = l; /* seed */ size_t step = (l>>5)|1; /* if string is too long, don't hash all its chars */ for (; l>=step; l-=step) h = h ^ ((h<<5)+(h>>2)+(unsigned char)*(s++)); return h; } That hash function would be weak even if it didn't ignore up to 97% of the input characters. OTOH, if it happens not to collide, ignoring up to 97% of the characters eliminates up to 97% of the expense of computing a hash. Etc. Lua's hashes do appear to get a major benefit from lacking a Python feature: user-defined comparisons can (a) raise exceptions, and (b) mutate the hash table *while* you're looking for a key in it. Those cause the Python implementation lots of expensive pain (indeed, the main reason Python has a distinct lookup function for string-keyed dicts is that it doesn't have to choke itself worrying about #a or #b for builtin strings). There's a lovely irony here. Python's dicts are fast because they've been optimized to death. When Lua's dicts are fast, it seems more the case it's because they don't worry much about bad cases. That's *supposed* to be Python's trick <wink>.
[David Abrahams]
... Noticing that also left me with a question: how come everybody in the world hasn't stolen as much as possible from the Python hashing implementation? Are there a billion such 10-years'-tweaked implementations lying around which all perform comparably well?
It's a Mystery, and in all directions. Python has virtually no code from, say, Tcl or Perl either, and the latter certainly do some things better than Python does them. I've studied all their hash implementations, but didn't find anything worth stealing <wink>; OTOH, multiple attempts by multiple groups to steal Perl's regexp engine years ago fizzled out in a tarpit of frustration. Curious: Python independently developed a string hash *very* similar to what later became "the standard" Fowler-Noll-Vo string hash: http://www.isthe.com/chongo/tech/comp/fnv/ The multiplier is different, and the initial value, but that's it. I'm sure there was no communication in either direction. So ya, given enough time, a billion other projects will rediscover it too.
OTOH, it can be very hard to write an efficient, correct "<" ordering, while testing just "equal or not?" can be easier and run quicker than that. Dict comparison is a good example from the Python core: computing "<" for dicts is a nightmare, but computing "==" for dicts is easy (contrast the straightforward dict_equal() with the brain-busting dict_compare() + characterize() pair).
Well, OK, ordering hash tables is hard, unless the bucket count is a deterministic function of the element count.
I don't know how the latter could help; for that matter, I'm not even sure what it means.
If they were sorted containers, of course, < would be a simple matter.
Yes.
And I assume that testing equality still involves a lot of hashing...
No more times than the common length of the two dicts. It's just: def dict_equal(dict1, dict2): if len(dict1) != len(dict2): return False for key, value in dict1.iteritems(): if key not in dict2 or not value == dict2[key]: return False return True Searching dict2 for key *may* involve hashing key again (or it may not; for example, Python string objects are immutable and cache their 32-bit hash in the string object the first time it's computed). There's a world of pain involved in the "==" there, though, as a dict can very well have itself as a value in itself, and the time required for completion appears to be exponential in some pathological cases of that kind (Python does detect the unbounded recursion in such cases -- eventually).
Hmm, looking at the 3 C++ implementations of hashed containers that I have available to me, only one provides operator<(), which is rather strange since the other two implement operator == by first comparing sizes, then iterating through consecutive elements of each set looking for a difference. The implementation supplying operator<() uses a (IMO misguided) design that rehashes incrementally, but it seems to me that if the more straightforward approaches can implement operator==() as described, operator<() shouldn't have to be a big challenge for an everyday hash table.
I'm obviously missing something, but what...?
I don't know, but I didn't follow what you were saying (like, "rehashes incrementally" doesn't mean anything to me). If there's a simpler way to get "the right" answer, I'd love to see it. I once spent two hours trying to prove that the dict_compare() + characterize() pair in Python was correct, but gave up in a mushrooming forest of end cases. In The Beginning, Python implemented dict comparison by materializing the .items(), sorting both, and then doing list comparison. The correctness of that was easy to show. But it turned out that in real life all anyone ever used was == comparison on dicts, and sorting was enormously expensive compared to what was possible. characterize() is a very clever hack Guido dreamt up to get the same result in no more than two passes -- but I've never been sure it's a thoroughly correct hack. OTOH, since nobody appears to care about "<" for dicts, if it's wrong we may never know that.
This was one of the motivations for introducing "rich comparisons".
I don't see how that helps. Got a link? Or a clue?
Sorry, I don't understand the question. When Python funneled all comparisons through cmp(), it wasn't possible for a type implementation to do anything faster for, say, "==", because it had no idea why cmp() was being called. Allowing people to ask for the specific comparison they wanted is part of what "rich comparisons" was about, and speed was one of the reasons for adopting it. Comparing strings for equality/inequality alone is also done faster than needing to resolve string ordering. And complex numbers have no accepted "<" relation at all. So comparing dicts isn't the only place it's easier and quicker to restrict the burden on the type to implementing equality testing. For user-defined types, I've often found it *much* easier. For example, I can easily tell whether two chessboards are equal (do they have the same pieces on the same squares?), but a concept of "<" for chessboards is strained.
I don't know what that means.
There's too much of that on both sides here, so I delcare this mercifully ended now <0.9 wink>.
If you represent your sets as sorted containers, getting a strict weak ordering on sets is trivial; you just do it with a lexicographical comparison of the two sequences.
And if you don't, that conclusion doesn't follow.
.,, No, I suppose not. But python's dicts are general-purpose containers, and you can put any key you like in there. It's still surprising to me given my (much less than 10 years') experience with hash implementations that you can design something that performs well over all those different cases.
You probably can't a priori, but after a decade people stumble into all the cases that don't work well, and you eventually fiddle the type-specific hash functions and the general implementation until surprises appear to stop. It remains a probabilistic method, though, and there are no guarantees. BTW, I believe that of all Python's builtin types, only the hash function for integers remains in its original form (hash(i) == i). So even if I don't want to, I'm forced to agree that finding a good hash function isn't trivial. [on Zope's B-Trees]
Aww, heck, you just need a good C++ exception-handling implementation to get rid of the error-checking overheads ;-)
I'd love to use C++ for this. This is one of those things that defines 5 families of 4 related data structures each via a pile of .c and .h files that get #include'd and recompiled 5 times after #define'ing a pile of macros. It would be *so* much more pleasant using templates.
... Thanks for the perspective!
still-learning-ly y'rs,
You're too old for that now -- start making money instead <wink>. the-psf-will-put-it-to-good-use-ly y'rs - tim
[David Abrahams]
... Noticing that also left me with a question: how come everybody in the world hasn't stolen as much as possible from the Python hashing implementation? Are there a billion such 10-years'-tweaked implementations lying around which all perform comparably well?
It's a Mystery, and in all directions. Python has virtually no code from, say, Tcl or Perl either, and the latter certainly do some things better
----- Original Message ----- From: "Tim Peters" <tim.one@comcast.net> To: "David Abrahams" <david.abrahams@rcn.com> Cc: <python-dev@python.org> Sent: Friday, June 28, 2002 1:01 AM Subject: RE: [Python-Dev] Priority queue (binary heap) python code than
Python does them. I've studied all their hash implementations, but didn't find anything worth stealing <wink>;
Well of course not!
OTOH, multiple attempts by multiple groups to steal Perl's regexp engine years ago fizzled out in a tarpit of frustration.
Oh, I had the impression that Python's re *was* pilfered Perl.
Curious: Python independently developed a string hash *very* similar to what later became "the standard" Fowler-Noll-Vo string hash:
http://www.isthe.com/chongo/tech/comp/fnv/
The multiplier is different, and the initial value, but that's it. I'm sure there was no communication in either direction. So ya, given enough time, a billion other projects will rediscover it too.
Nifty.
Well, OK, ordering hash tables is hard, unless the bucket count is a deterministic function of the element count.
I don't know how the latter could help; for that matter, I'm not even sure what it means.
I know what I meant, but I was wrong. My brain cell musta jammed. Ordering hash tables is hard if collisions are possible.
And I assume that testing equality still involves a lot of hashing...
No more times than the common length of the two dicts.
Of course.
It's just:
def dict_equal(dict1, dict2): if len(dict1) != len(dict2): return False for key, value in dict1.iteritems(): if key not in dict2 or not value == dict2[key]: return False return True
Searching dict2 for key *may* involve hashing key again (or it may not; for example, Python string objects are immutable and cache their 32-bit hash in the string object the first time it's computed).
Tricky. I guess a C++ object could be designed to cooperate with hash tables in that way also.
There's a world of pain involved in the "==" there, though, as a dict can very well have itself as a value in itself, and the time required for completion appears to be exponential in some pathological cases of that kind (Python does detect the unbounded recursion in such cases -- eventually).
Hmm, looking at the 3 C++ implementations of hashed containers that I have available to me, only one provides operator<(), which is rather strange since the other two implement operator == by first comparing sizes,
Yuck. I wouldn't expect any C++ implementation to handle that issue. then
iterating through consecutive elements of each set looking for a difference. The implementation supplying operator<() uses a (IMO misguided) design that rehashes incrementally, but it seems to me that if the more straightforward approaches can implement operator==() as described, operator<() shouldn't have to be a big challenge for an everyday hash table.
I'm obviously missing something, but what...?
I don't know, but I didn't follow what you were saying (like, "rehashes incrementally" doesn't mean anything to me).
If there's a simpler way to get "the right" answer, I'd love to see it. I once spent two hours
Get ahold of MSVC7 and look at the hash_set implementation. IIRC how Plaugher described it, it is constantly maintaining the load factor across insertions, so there's never a big cost to grow the table. It also keeps the items in each bucket sorted, so hash table comparisons are a lot easier. My gut tells me that this isn't worth what you pay for it, but so far my gut hasn't had very much of any value to say about hashing... The other implementations seem to implement equality as something like: template <class T, class Hash, class Compare, class Allocator> inline bool operator==(const hash_set<T, Hash, Compare, Allocator>& x, const hash_set<T, Hash, Compare, Allocator>& y) { return x.size() == y.size() && std::equal(x.begin(), x.end(), y.begin()); } Which has to be a bug unless they've got a very strange way of defining equality, or some kindof ordering built into the iterators. trying
to prove that the dict_compare() + characterize() pair in Python was correct, but gave up in a mushrooming forest of end cases.
I think it's a tougher problem in Python than in languages with value semantics, where an object can't actually contain itself.
In The Beginning, Python implemented dict comparison by materializing the .items(), sorting both, and then doing list comparison. The correctness of that was easy to show. But it turned out that in real life all anyone ever used was == comparison on dicts, and sorting was enormously expensive compared to what was possible. characterize() is a very clever hack Guido dreamt up to get the same result in no more than two passes -- but I've never been sure it's a thoroughly correct hack.
??? I can't find characterize() described anywhere, nor can I find it on my trusty dict objects:
help({}.characterize) Traceback (most recent call last): File "<stdin>", line 1, in ? AttributeError: 'dict' object has no attribute 'characterize'
OTOH, since nobody appears to care about "<" for dicts, if it's wrong we may never know that.
As long as the Python associative world is built around hash + ==, you're probably OK.
This was one of the motivations for introducing "rich comparisons".
I don't see how that helps. Got a link? Or a clue?
Sorry, I don't understand the question.
Well, you answered it pretty damn well anyway...
Comparing strings for equality/inequality alone is also done faster than needing to resolve string ordering. And complex numbers have no accepted "<" relation at all.
So comparing dicts isn't the only place it's easier and quicker to restrict the burden on
Yeah, good point. C++ has a less<T>/operator< dichotomy mostly to accomodate pointer types in segmented memory models, but there's no such accomodation for complex<T>. the
type to implementing equality testing. For user-defined types, I've often found it *much* easier. For example, I can easily tell whether two chessboards are equal (do they have the same pieces on the same squares?), but a concept of "<" for chessboards is strained.
Strained, maybe, but easy. You can do a lexicographic comparison of the square contents.
I don't know what that means.
There's too much of that on both sides here, so I delcare this mercifully ended now <0.9 wink>.
I, of course, will drag it on to the bitter end.
[on Zope's B-Trees]
Aww, heck, you just need a good C++ exception-handling implementation
to
get rid of the error-checking overheads ;-)
I'd love to use C++ for this. This is one of those things that defines 5 families of 4 related data structures each via a pile of .c and .h files that get #include'd and recompiled 5 times after #define'ing a pile of macros. It would be *so* much more pleasant using templates.
I have *just* the library for you. Works with 'C', too! http://www.boost.org/libs/preprocessor/doc/ Believe it or not, people are still pushing this technology to improve compilation times and debuggability of the result.
... Thanks for the perspective!
still-learning-ly y'rs,
You're too old for that now -- start making money instead <wink>.
Sorry, I'll try hard to grow up now. -Dave
On Fri, Jun 28, 2002, David Abrahams wrote:
From: "Tim Peters" <tim.one@comcast.net>
OTOH, multiple attempts by multiple groups to steal Perl's regexp engine years ago fizzled out in a tarpit of frustration.
Oh, I had the impression that Python's re *was* pilfered Perl.
Thank Fredrik for a brilliant job of re-implementing Perl's regex syntax into something that I assume is maintainable (haven't looked at the code myself) *and* Unicode compliant. -- Aahz (aahz@pythoncraft.com) <*> http://www.pythoncraft.com/ Project Vote Smart: http://www.vote-smart.org/
From: "Aahz" <aahz@pythoncraft.com>
Oh, I had the impression that Python's re *was* pilfered Perl.
Thank Fredrik for a brilliant job of re-implementing Perl's regex syntax into something that I assume is maintainable (haven't looked at the code myself) *and* Unicode compliant.
Thanks, Fredrik!
[Aahz]
Thank Fredrik for a brilliant job of re-implementing Perl's regex syntax into something that I assume is maintainable (haven't looked at the code myself) *and* Unicode compliant.
Yes, this seems to be a very good thing for Python. Speedy regexp engines are notoriously hard to maintain cleanly, at least, so told me a few successive maintainers of GNU regexp. Difficult points are deterministic matching (avoiding backtracking) and POSIX compliance, and the longest match criterion in particular. For one, I'm pretty happy with Python regexp implementation, even if it avoids the above points. It has other virtues that are well worth the trade, at least from the experience I have of it so far! So, in a word, thanks too! :-) -- François Pinard http://www.iro.umontreal.ca/~pinard
david wrote:
OTOH, multiple attempts by multiple groups to steal Perl's regexp engine years ago fizzled out in a tarpit of frustration.
Oh, I had the impression that Python's re *was* pilfered Perl.
Tim warned me that the mere attempt to read sources for existing RE implementations was a sure way to destroy my brain, so I avoided that. SRE is a clean-room implementation, using the Python 1.5.2 docs and the regression test suite as the only reference. I've never written a Perl regexp in my life. </F>
Fredrik Lundh <fredrik@pythonware.com>:
Tim warned me that the mere attempt to read sources for existing RE implementations was a sure way to destroy my brain, so I avoided that.
Oh, no! Does that mean I should attach a health warning to the source of Plex??? Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | A citizen of NewZealandCorp, a | Christchurch, New Zealand | wholly-owned subsidiary of USA Inc. | greg@cosc.canterbury.ac.nz +--------------------------------------+
[Oren Tirosh]
A sorted list is a much more general-purpose data structure than a priority queue and can be used to implement a priority queue. [...] The only advantage of a heap is O(1) peek which doesn't seem so critical. [...] the internal order of a heap-based priority queue is very non-intuitive and quite useless for other purposes while a sorted list is, umm..., sorted!
It surely occurred to many of us to sort a file (or any set of data) from the most interesting entry to the least interesting entry, look at the first 5% to 10%, and drop all the rest. A heap is a good way to retain the first few percents of items, without going through the lengths of fully sorting all the rest. By comparison, it would not be efficient to use `.sort()' then truncate. Within a simulation, future events are scheduled while current events are being processed, so we do not have all the events to `.sort()' first. It is likely that heaps would beat insertion after binary search, given of course that both are implemented with the same care, speed-wise. -- François Pinard http://www.iro.umontreal.ca/~pinard
"Kevin O'Connor" <kevin@koconnor.net> writes:
Any chance something like this could make it into the standard python library? It would save a lot of time for lazy people like myself. :-)
I think this deserves a library PEP. I would also recommend to have a separate heap and priority queue API, to avoid the kind of confusion that Skip ran into. Something like the C++ STL API might be appropriate: the heap functions take a comparator function, on top of which you offer both heapsort and priority queues. The technical issues set aside, the main purpose of a library PEP is to record a commitment from the author to maintain the module, with the option of removing the module if the author runs away, and nobody takes over. Regards, Martin
[Kevin O'Connor]
I often find myself needing priority queues in python, and I've finally broken down and written a simple implementation. [...] Any chance something like this could make it into the standard python library?
Two years ago, I (too!) wrote one (appended below) and I offered it to Guido. He replied he was not feeling like adding into the Python standard library each and every interesting algorithm on this earth. So I did not insist :-). -- François Pinard http://www.iro.umontreal.ca/~pinard
Any chance something like this could make it into the standard python library? It would save a lot of time for lazy people like myself. :-)
def heappush(heap, item): pos = len(heap) heap.append(None) while pos: parentpos = (pos - 1) / 2 parent = heap[parentpos] if item <= parent: break heap[pos] = parent pos = parentpos heap[pos] = item
def heappop(heap): endpos = len(heap) - 1 if endpos <= 0: return heap.pop() returnitem = heap[0] item = heap.pop() pos = 0 while 1: child2pos = (pos + 1) * 2 child1pos = child2pos - 1 if child2pos < endpos: child1 = heap[child1pos] child2 = heap[child2pos] if item >= child1 and item >= child2: break if child1 > child2: heap[pos] = child1 pos = child1pos continue heap[pos] = child2 pos = child2pos continue if child1pos < endpos: child1 = heap[child1pos] if child1 > item: heap[pos] = child1 pos = child1pos break heap[pos] = item return returnitem
I have read (or at least skimmed) this entire thread now. After I reconstructed the algorithm in my head, I went back to Kevin's code; I admire the compactness of his code. I believe that this would make a good addition to the standard library, as a friend of the bisect module. The only change I would make would be to make heap[0] the lowest value rather than the highest. (That's one thing that I liked better about François Pinard's version, but a class seems too heavy for this, just like it is overkill for bisect [*]. Oh, and maybe we can borrow a few lines of François's description of the algorithm. :-) I propose to call it heapq.py. (Got a better name? Now or never.) [*] Afterthought: this could be made into an new-style class by adding something like this to the end of module: class heapq(list): __slots__ = [] heappush = heappush heappop = heappop A similar addition could easily be made to the bisect module. But this is very different from François' class, which hides the other list methods. --Guido van Rossum (home page: http://www.python.org/~guido/)
[Guido van Rossum]
Oh, and maybe we can borrow a few lines of François's description of the algorithm. :-)
Borrow liberally! I would prefer that nothing worth remains un-borrowed from mine, so I can happily get rid of my copy when the time comes! :-)
I propose to call it heapq.py. (Got a better name? Now or never.)
I like `heapq' as it is not an English common name, like `heap' would be, so less likely to clash with user chosen variable names! This principle should be good in general. Sub-classing `heapq' from `list' is a good idea! P.S. - In other languages, I have been using `string' a lot, and this has been one of the minor irritations when I came to Python, that it forced me away of that identifier; so I'm now using `text' everywhere, instead. Another example is the name `socket', which is kind of reserved from the module name, I never really know how to name variables holding sockets :-). -- François Pinard http://www.iro.umontreal.ca/~pinard
On Sat, Jul 20, 2002 at 02:06:29AM -0400, Guido van Rossum wrote:
Any chance something like this could make it into the standard python library? It would save a lot of time for lazy people like myself. :-)
I have read (or at least skimmed) this entire thread now. After I reconstructed the algorithm in my head, I went back to Kevin's code; I admire the compactness of his code. I believe that this would make a good addition to the standard library, as a friend of the bisect module.
Thanks!
The only change I would make would be to make heap[0] the lowest value rather than the highest.
I agree this appears more natural, but a priority queue that pops the lowest priority item is a bit odd.
I propose to call it heapq.py. (Got a better name? Now or never.)
[*] Afterthought: this could be made into an new-style class by adding something like this to the end of module:
Looks good to me. Thanks again, -Kevin -- ------------------------------------------------------------------------ | Kevin O'Connor "BTW, IMHO we need a FAQ for | | kevin@koconnor.net 'IMHO', 'FAQ', 'BTW', etc. !" | ------------------------------------------------------------------------
[Guido van Rossum]
[...] I admire the compactness of his code. I believe that this would make a good addition to the standard library, as a friend of the bisect module. [...] The only change I would make would be to make heap[0] the lowest value rather than the highest. I propose to call it heapq.py.
[Kevin O'Connor]
Looks good to me.
In case you going forward with `heapq', and glancing through my notes, I see that "Courageous" implemented a priority queue algorithm as a C extension, and discussed it on python-list on 2000-05-29. I'm not really expecting that you aim something else than a pure Python version, and I'm not pushing nor pulling for it, as I do not have an opinion. In any case, I'll keep these messages a few more days: just ask, and I'll send you a copy of what I saved at the time. P.S. - I'm quickly loosing interests in these bits of C code meant for speed, as if I ever need C speed, the wonderful Pyrex tool (from Greg Ewing) gives it to me while allowing the algorithm to be expressed in a language close to Python. I even wonder if Pyrex could not be a proper avenue for the development of some parts of the Python distribution itself. -- François Pinard http://www.iro.umontreal.ca/~pinard
Kevin O'Connor <kevin@koconnor.net>:
I often find myself needing priority queues in python, and I've finally broken down and written a simple implementation.
[snip] I see that heapq is now in the libraries -- great! Just one thought: If I want to use this library in an algorithm such as, say, Dijkstra's single-source shortest path algorithms, I would need an additional operation, the deacrease-key operation (as far as I can see, the heapreplace only works with index 0 -- wy is that?) E.g.: def heapdecrease(heap, index, item): """ Replace an item at a given index with a smaller one. May be used to implement the standard priority queue method heap-decrease-key, useful, for instance, in several graph algorithms. """ assert item <= heap[index] heap[index] = item _siftdown(heap, 0, index) Something might perhaps be useful to include in the library... Or, perhaps, the _siftup and _siftdown methods don't have to be private? In addition, I guess one would have to implement a sequence class that maintained a map from values to heap indices to be able to use heapdecrease in any useful way (otherwise, how would you know which index to use?). That, however, I guess is not something that would be 'at home' in the heapq module. (Perhaps that is argument enough to avoid including heapdecrease as well? Oh, well...) -- Magnus Lie Hetland The Anygui Project http://hetland.org http://anygui.org
[Magnus Lie Hetland]
I see that heapq is now in the libraries -- great!
Just one thought: If I want to use this library in an algorithm such as, say, Dijkstra's single-source shortest path algorithms, I would need an additional operation, the deacrease-key operation
You'd need more than just that <wink>.
(as far as I can see, the heapreplace only works with index 0 -- wy is that?)
heapreplace() is emphatically not a decrease-key operation. It's equivalent to a pop-min *followed by* a push, which combination is often called a "hold" operation. The value added may very well be larger than the value popped, and, e.g., the example of an efficient N-Best queue in the test file relies on this. Hold is an extremely common operation in some kinds of event simulations too, where it's also most common to push a value larger than the one popped (e.g., when the queue is ordered by scheduled time, and the item pushed is a follow-up event to the one getting popped).
E.g.:
def heapdecrease(heap, index, item): """ Replace an item at a given index with a smaller one.
May be used to implement the standard priority queue method heap-decrease-key, useful, for instance, in several graph algorithms. """ assert item <= heap[index]
That's the opposite of what heapreplace() usually sees.
heap[index] = item _siftdown(heap, 0, index)
Something might perhaps be useful to include in the library...
I really don't see how -- you generally have no way to know the correct index short of O(N) search. This representation of priority queue is well-known to be sub-optimal for applications requiring frequent decrease-key (Fibonacci heaps were desgined for it, though, and pairing heaps are reported to run faster than Fibonacci heaps in practice despite that one of the PH inventors eventually proved that decrease-key can destroy PH's otherwise good worst-case behavior).
Or, perhaps, the _siftup and _siftdown methods don't have to be private?
You really have to know what you're doing to use them correctly, and it's dubious that _siftup calls _siftdown now (it's most convenient *given* all the uses made of them right now, but, e.g., if a "delete at arbitrary index" function were to be added, _siftdown and _siftup could stand to be refactored -- exposing them would inhibit future improvements).
In addition, I guess one would have to implement a sequence class that maintained a map from values to heap indices to be able to use heapdecrease in any useful way (otherwise, how would you know which index to use?).
Bingo. All the internal heap manipulations would have to know about this mapping too, in order to keep the indices up to date as it moved items up and down in the queue. If want you want is frequent decrease-key, you don't want this implementation of priority queues at all.
participants (16)
-
Aahz
-
Alex Martelli
-
David Abrahams
-
Fredrik Lundh
-
Gordon McMillan
-
Greg Ewing
-
Guido van Rossum
-
Gustavo Niemeyer
-
Kevin O'Connor
-
Magnus Lie Hetland
-
martin@v.loewis.de
-
Oren Tirosh
-
pinard@iro.umontreal.ca
-
Skip Montanaro
-
Tim Peters
-
Tim Peters