Spreadsheet-style dependency tracking

Sun Oct 17 02:39:33 EDT 2010

In article <87y69xbz6h.fsf at mid.deneb.enyo.de>
Florian Weimer  <fw at deneb.enyo.de> wrote:
>Are there libraries which implement some form of spreadsheet-style
>dependency tracking?  The idea is to enable incremental updates to
>some fairly convoluted computation.  I hope that a general dependency
>tracking framework would avoid making the computation even more
>convoluted and difficult to change.

Don't know of any libraries myself, but I wrote some code to do
topological sort for dependencies, which I can paste here.  It
is also worth noting that a lot of spreadsheets cheat: they just
repeat a sheet-wide computation until values stop changing (or a
cycle count limit runs out).

The code below uses a datatype that has a get_deps() that gives
you a list of "cells I depend on for my value", i.e., for a
spreadsheet, if a cell had the rule "my value = (A1 + B2) / C3"
then get_deps() would have to return the list [cell_holding_A1,
cell_holding_B2, cell_holding_C3] (in any order).  It assumes that
it can write on cell.visited.  It returns (as a tuple) a list giving
"some evaluation order", and a list of cycles found.  If the cycle
list is empty, evaluating the cells in the given evaluation order,
once, is sufficient to update the sheet.  If the cycle list is
nonempty, these are cells containing some kind of circular
dependencies, and each element of the cycle list is itself a list
of cells.

(This code was part of an exercise, as it were, and I never finished
the whole thing, just enough to test the topo-sorting, and some
other exercise-y bits.  I added cycle-collecting as a later item.
My testing was not terribly thorough either, but I think the code
is still functional and correct.)

      --------

# Topological sort due to Tarjan (1974, 1976)
# <http://en.wikipedia.org/wiki/Topological_sorting>
#
# Return a list of cells in some suitable order, plus
# a list (not necessarily complete) that says "these loops
# were found" (in which case the list is not truly sorted).
def tsort(sheet):
    class Locals(object):
        def __init__(self):
            self.S = {}
            self.L = []
            self.cycles = []
    class FakeCell(object):
        def __init__(self):
            self.deps = []
        def __str__(self):
            return 'mark'

    # initial result list (r.L) is empty
    r = Locals()

    # Do a depth first search of "cell" and its dependents.
    #
    # The recursion is flattened here to avoid blowing out
    # Python's function-call stack, regardless of how deep
    # the code gets.  See alternative recursive version below.
    def visit(cell):
        mark = FakeCell()   # marks "recursion points" w/in worklist
        worklist = [cell]   # list of cells we still have to visit
        nstack = []         # cell stack during (flattened) recursion
        while worklist:
            cell = worklist.pop()
            # If we hit the "mark", we want to pop out of a flattened
            # recursion.  We could just store the cell itself, but:
            #  - we need a separate stack of just the "pushed" cells
            #    that are being recursed-on, and
            #  - we need to know when we have reached that point
            # so worklist has "mark"s wherever we should pop a cells
            # off the cells-stack "nstack" instead.  (At which point
            # the only thing left to do is add it to the result-list.)
            if cell is mark:
                cell = nstack.pop()
                #print 'popped out of recursion, cell:', cell
                r.L.append(cell)
                continue
            # If we have seen this cell before, don't re-work it.
            # If the way we saw this cell before was "it's on our
            # cells stack", we have a cycle (otherwise this a
            # harmless revisit due to being somewhere else in the
            # overall graph).
            if cell.visited:
                #print 'revisit:', cell, 'stack:', nstack
                if cell in nstack:
                    # The cycle runs from nstack[nstack.index(cell)]
                    # through the end of nstack.
                    r.cycles.append(nstack[nstack.index(cell):])
                continue
            #print 'first visit for:', cell, 'deps:', cell.get_deps()
            #print 'nstack:', nstack
            cell.visited = True
            # TODO: optimize if cell.get_deps() is empty (we'll add a mark
            # and then just pop it)
            #print 'do deps:'
            nstack.append(cell)
            #print 'worklist before:', worklist
            worklist = (worklist + [mark] +
                [r.S[ref.get()] for ref in cell.get_deps()])
            #print 'worklist after:', worklist

#    def visit(node, nstack = None):
#        if nstack is None:
#            nstack = []
#        if node.visited:
#            if node in nstack:
#                r.cycles = True
#            return
#        node.visited = True
#        nstack.append(node)
#        for c in node.deps:
#            visit(r.S[c], nstack)
#        nstack.pop()
#        r.L.append(node)

    # Build "set S of all cells" (r.S) which gives their dependencies.
    # By indexing by cell, we can find cells from dependencies in visit().
    for row in sheet:
        for cell in row:
            if cell:
                r.S[cell] = cell
                cell.visited = False

    # Now simply (initial-)visit all the cells.
    for cell in r.S.itervalues():
        visit(cell)

    # Now r.L defines an evaluation order; it has at least one cycle in it
    # if r.cycles is nonempty.
    return (r.L, r.cycles)
-- 
In-Real-Life: Chris Torek, Wind River Systems
Salt Lake City, UT, USA (40°39.22'N, 111°50.29'W)  +1 801 277 2603
email: gmail (figure it out)      http://web.torek.net/torek/index.html