Database specialized in storing directed graphs?
castironpi at gmail.com
Tue Oct 28 22:17:10 CET 2008
On Oct 27, 7:32 pm, Carl Banks <pavlovevide... at gmail.com> wrote:
> I was wondering if anyone had any advice on this.
> This is not to study graph theory; I'm using the graph to represent a
> problem domain. The graphs could be arbitrarily large, and could
> easily have millions of nodes, and most nodes have a substantial
> amount of data associated with them. Obviously I don't want a whole
> such graph in memory at once, so libraries the just deal with in-
> memory graphs are out.
> I know I could implement this with a relational DB, and I'd be fine
> with a library that was built on top of one. But I'm hoping for
> specialzed behavior useful for graphs.
> For example, suppose I have a set of nodes called A. It would be
> useful to know if any of these nodes are connected by paths that
> include no nodes in A. I could, of course, do that by reading from
> the database and following the paths, but I clearly don't want to do
> that. I would want instead to somehow cache the outside connectedness
> relationship between different nodes of A. But then what happens if
> something changes elsewhere. How is the cache for A notified, and can
> it be updated efficiently using graph theorems, rather than
> It's very tricky; that's why I hope someone else has done it.
> I'm guessing no.
> Carl Banks
Are you just looking for a persistent graph? The usual options are
'shelve', 'sqllite', 'mmap', and 'ctypes.Structure'. Or did you need
a special structure for the 'outside connectedness' properties?
Storing that is the usual time-space tradeoff. (Actually, I think
that term refers to something slightly distinct. This would be more
properly termed read-time/write-time trade off, roughly.)
Assuming you pick an RDB, you'd have a table of vertices and a table
of edges. Did you want 'set A' to be a table and persist between runs
of the program? If not, just keep a set of 2-tuples of nodes that
satisfy that property. I think the set S you're after is: all x,y
such that x is a vertex in A, y is a vertex in A, and there exists a P
such that P is a path, P starts at x, P ends at y, and vertex v in P
implies that v is x, v is y, or v is not in A. I don't know if you
can cache any information about A that gives you any shortcuts in
calculating S, or what the running time or running space of the
fastest/smallest algorithm is. At worst, it's O( |V| * |E| ) on each
change to V or E. Unverified.
You might be able to store the shortest path in a mapping from 2-
tuples to paths. Or better yet, map each node in A to the set of all
nodes reachable from it only entering A once. Then, you might save
time on either adding a node/vertex to A or G, or removing one from A
or G, or both. Maybe mapping each node in G to that relative set.
You didn't say whether the graph was directed and/or cyclic.
A good place to start would be, if you add a node to G, can S
decrease? No, because the original path P still exists. If you
remove one from A, still in G, S can stay the same size, might
decrease. If you remove a node from G, not in A, same, etc.
More information about the Python-list