Database specialized in storing directed graphs?

Tue Oct 28 17:17:10 EDT 2008

On Oct 27, 7:32 pm, Carl Banks <pavlovevide... at gmail.com> wrote:
> I was wondering if anyone had any advice on this.
>
> This is not to study graph theory; I'm using the graph to represent a
> problem domain.  The graphs could be arbitrarily large, and could
> easily have millions of nodes, and most nodes have a substantial
> amount of data associated with them.  Obviously I don't want a whole
> such graph in memory at once, so libraries the just deal with in-
> memory graphs are out.
>
> I know I could implement this with a relational DB, and I'd be fine
> with a library that was built on top of one.  But I'm hoping for
> specialzed behavior useful for graphs.
>
> For example, suppose I have a set of nodes called A.  It would be
> useful to know if any of these nodes are connected by paths that
> include no nodes in A.  I could, of course, do that by reading from
> the database and following the paths, but I clearly don't want to do
> that.  I would want instead to somehow cache the outside connectedness
> relationship between different nodes of A.  But then what happens if
> something changes elsewhere.  How is the cache for A notified, and can
> it be updated efficiently using graph theorems, rather than
> regenerated?
>
> It's very tricky; that's why I hope someone else has done it.
>
> I'm guessing no.
>
> Carl Banks

Are you just looking for a persistent graph?  The usual options are
'shelve', 'sqllite', 'mmap', and 'ctypes.Structure'.  Or did you need
a special structure for the 'outside connectedness' properties?

Storing that is the usual time-space tradeoff.  (Actually, I think
that term refers to something slightly distinct.  This would be more
properly termed read-time/write-time trade off, roughly.)

Assuming you pick an RDB, you'd have a table of vertices and a table
of edges.  Did you want 'set A' to be a table and persist between runs
of the program?  If not, just keep a set of 2-tuples of nodes that
satisfy that property.  I think the set S you're after is: all x,y
such that x is a vertex in A, y is a vertex in A, and there exists a P
such that P is a path, P starts at x, P ends at y, and vertex v in P
implies that v is x, v is y, or v is not in A.  I don't know if you
can cache any information about A that gives you any shortcuts in
calculating S, or what the running time or running space of the
fastest/smallest algorithm is.  At worst, it's O( |V| * |E| ) on each
change to V or E.  Unverified.

You might be able to store the shortest path in a mapping from 2-
tuples to paths.  Or better yet, map each node in A to the set of all
nodes reachable from it only entering A once.  Then, you might save
time on either adding a node/vertex to A or G, or removing one from A
or G, or both.  Maybe mapping each node in G to that relative set.
You didn't say whether the graph was directed and/or cyclic.

A good place to start would be, if you add a node to G, can S
decrease?  No, because the original path P still exists.  If you
remove one from A, still in G, S can stay the same size, might
decrease.  If you remove a node from G, not in A, same, etc.