Database specialized in storing directed graphs?

Tue Oct 28 12:48:06 EDT 2008

Sorry Carl Banks for the answering delay, there are problems in Google
Groups.

> This is not to study graph theory; I'm using the graph to represent a
> problem domain.  The graphs could be arbitrarily large, and could
> easily have millions of nodes, and most nodes have a substantial
> amount of data associated with them.  Obviously I don't want a whole
> such graph in memory at once, so libraries the just deal with in-
> memory graphs are out.

It doesn't sound a problem for modern PCs.
I think you can keep the whole graph topology in RAM, and the node
data on disk (for example in a file or in DB). The topology is
represented by the arcs (you have said nothing regarding arc data, so
I assume it's absent) and the nodes (in RAM you just need a 32 bit
unsigned integer to represent the index of the node that is stored on
disk. If memory becomes tight you can use just 3 bytes (2 ^ 24 = 16
millions different nodes) for the nodes, but generally the memory
required for the nodes is very little compared to the memory necessary
to store the arcs).

You haven't said how many arcs there are, total or average for node,
and if such arcs are directed or undirected.

Anyway, using my Graph class (that stores each arc twice), this takes
about 1 minute and 1.3 GB of RAM (1 million nodes, 10 arcs for node):

from graph import Graph
from random import randrange
g = Graph()
N = 1000000
g.addNodes(xrange(N))
for i in xrange(N * 10):
    g.addArc(randrange(N), randrange(N))

You have said "could easily have millions of nodes", and every arc may
have tens of arcs or more.
("arbitrarily large" is an unsolvable problem because there are always
limits in your program, there's no program that's able to work on an
"arbitrarily large" dataset), so Python data structures become too
much costly for the RAM. With a lower level language like D/C/C++ you
can manage a bigger graph. You can use Boost Graph, but a homemade
graph structure may suffice too for your purposes.

For the problem you have explained I think a very simple graph
representation may suffice: an array of integer pairs (starting_index,
len) (starting_index can be a pointer too), where len is the number of
outbound arcs of the node n, plus an array of indexes/pointers that
list the outbound arcs. If memory gets tight you can split this second
array in two, and use an array of bytes for the lengths (if you have
more than 256 outbound arcs you may need a short). Note that if you
use indexes then a Python array.array (or Numpy) suffices.

In this situation if nnodes = 10_000_000 and narcs/node = 40 (total
nodes = 40 * 10_000_000):
nnodes * narcs * 4 + nnodes * (4 + 4) = 1_680_000_000 bytes, that is
often available on modern PCs.

On a 64 bit machine the indexes take the same memory, but pointers
twice that.

In "memory saving" mode:
nnodes * narcs * 3 + nnodes * (3 + 1) = 1_240_000_000 bytes.

A more handy compromise is:
nnodes * narcs * 3 + nnodes * (4 + 4) = 1_280_000_000 bytes.

> But then what happens if something changes elsewhere.

Regarding the data structure, if you use arrays like I have explained,
if your updates aren't much frequent then you can allocate small extra
arrays to store more arcs coming out a node (but to do this you
probably may enjoy using pointers instead of indexes). When you have a
lot of updated you can save all to disk, and then regenerate the whole
data structure.

Bye,
bearophile