[Python-Dev] Re: heaps

Thu, 24 Apr 2003 16:20:50 -0700

In article <001901c30aab$bf31c060$b6b8958d@oemcomputer>,
 "Raymond Hettinger" <python@rcn.com> wrote:

> > I'd very much like to see the current heapq replaced with a different
> > interface in time for 2.3.  I believe that an opaque object is better,
> > since it allows more flexibility later.  
> 
> I'm quite pleased with the version already in CVS.  It is a small
> masterpiece of exposition, sophistication, simplicity, and speed.
> A class based interface is not necessary for every algorithm.

It has some elegance, but omits basic operations that are necessary for 
many heap-based algorithms and are not provided by this interface.  
Specifically, the three algorithms that use heaps in my upper-division 
undergraduate algorithms classes are heapsort (for which heapq works 
fine, but you would generally want to use L.sort() instead), Dijkstra's 
algorithm (and its relatives such as A* and Prim), which needs the 
ability to decrease keys, and event-queue-based plane sweep algorithms 
(e.g. for finding all crossing pairs in a set of line segments) which 
need the ability to delete items from other than the top.

To see how important the lack of these operations is, I decided to 
compare two implementations of Dijkstra's algorithm.  The priority-dict 
implementation from 
http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/119466 takes as 
input a graph, coded as nested dicts {vertex: {neighbor: edge length}}.
This is a variation of a graph coding suggested in one of Guido's essays 
that, as Raymond suggests, avoids using a separate class based interface.

Here's a simplification of my dictionary-based Dijkstra implementation:

def Dijkstra(G,start,end=None):
    D = {}   # dictionary of final distances
    P = {}   # dictionary of predecessors
    Q = priorityDictionary()   # est.dist. of non-final vert.
    Q[start] = 0
    for v in Q:
        D[v] = Q[v]
        for w in G[v]:
            vwLength = D[v] + G[v][w]
            if w not in D and (w not in Q or vwLength < Q[w]):
                Q[w] = vwLength
                P[w] = v
   return (D,P)

Here's a translation of the same implementation to heapq (untested 
since I'm not running 2.3).  Since there is no decrease in heapq, nor 
any way to find and remove old keys, I changed the algorithm to add new 
tuples for each new key, leaving the old tuples in place until they 
bubble up to the top of the heap.

def Dijkstra(G,start,end=None):
    D = {}   # dictionary of final distances
    P = {}   # dictionary of predecessors
    Q = [(0,None,start)]  # heap of (est.dist., pred., vert.)
    while Q:
        dist,pred,v = heappop(Q)
        if v in D:
            continue  # tuple outdated by decrease-key, ignore
        D[v] = dist
        P[v] = pred
        for w in G[v]:
            heappush(Q, (D[v] + G[v][w], v, w))
    return (D,P)

My analysis of the differences between the two implementations:

- The heapq version is slightly complicated (the two lines 
if...continue) by the need to explicitly ignore tuples with outdated 
priorities.  This need for inserting low-level data structure 
maintenance code into higher-level algorithms is intrinsic to using 
heapq, since its data is not structured in a way that can support 
efficient decrease key operations.

- Since the heap version had no way to determine when a new key was 
smaller than an old one, the heapq implementation needed two separate 
data structures to maintain predecessors (middle elements of tuples for 
items in queue, dictionary P for items already removed from queue).  In 
the dictionary implementation, both types of items stored their 
predecessors in P, so there was no need to transfer this information 
from one structure to another.

- The dictionary version is slightly complicated by the need to look up 
old heap keys and compare them with the new ones instead of just 
blasting new tuples onto the heap.  So despite the more-flexible heap 
structure of the dictionary implementation, the overall code complexity 
of both implementations ends up being about the same.

- Heapq forced me to build tuples of keys and items, while the 
dictionary based heap did not have the same object-creation overhead 
(unless it's hidden inside the creation of dictionary entries).  On the 
other hand, since I was already building tuples, it was convenient to 
also store predecessors in them instead of in some other structure.

- The heapq version uses significantly more storage than the dictionary: 
proportional to the number of edges instead of the number of vertices.

- The changes I made to Dijkstra's algorithm in order to use heapq might 
not have been obvious to a non-expert; more generally I think this lack 
of flexibility would make it more difficult to use heapq for 
cookbook-type implementation of textbook algorithms.

- In Dijkstra's algorithm, it was easy to identify and ignore outdated 
heap entries, sidestepping the inability to decrease keys.  I'm not 
convinced that this would be as easy in other applications of heaps.

- One of the reasons to separate data structures from the algorithms 
that use them is that the data structures can be replaced by ones with 
equivalent behavior, without changing any of the algorithm code.  The 
heapq Dijkstra implementation is forced to include code based on the 
internal details of heapq (specifically, the line initializing the heap 
to be a one element list), making it less flexible for some uses.  The 
usual reason one might want to replace a data structure is for 
efficiency, but there are others: for instance, I teach various 
algorithms classes and might want to use an implementation of Dijkstra's 
algorithm as a testbed for learning about different priority queue data 
structures.  I could do that with the dictionary-based implementation 
(since it shows nothing of the heap details) but not the heapq one.

Overall, while heapq was usable for implementing Dijkstra, I think it 
has significant shortcomings that could be avoided by a more 
well-thought-out interface that provided a little more functionality and 
a little clearer separation between interface and implementation.

-- 
David Eppstein                      http://www.ics.uci.edu/~eppstein/
Univ. of California, Irvine, School of Information & Computer Science