very large graph
castironpi at gmail.com
Tue Jun 24 18:23:29 CEST 2008
On Jun 24, 5:58 am, "A.T.Hofkamp" <h... at se-162.se.wtb.tue.nl> wrote:
> On 2008-06-24, MRAB <goo... at mrabarnett.plus.com> wrote:
> > On Jun 24, 1:26 am, chrispoliq... at gmail.com wrote:
> >> I need to represent the hyperlinks between a large number of HTML
> >> files as a graph. My non-directed graph will have about 63,000 nodes
> >> and and probably close to 500,000 edges.
> >> I have looked into igraph (http://cneurocvs.rmki.kfki.hu/igraph/doc/
> >> python/index.html) and networkX (https://networkx.lanl.gov/wiki) for
> >> generating a file to store the graph, and I have also looked into
> >> Graphviz for visualization. I'm just not sure which modules are
> >> best. I need to be able to do the following:
> Afaik Graphviz is not good at abstracting the graph, which you may need here.
> A page with 63,000 circles on it, and 500,000 edges will probably come out of
> the printer as a black sheet of paper.
> (8"x11" paper, 1 circle is 1/5", then you have only 2200 circles at one sheet.
> You need a factor 28 more circles which leads to a circle of about 0.007".)
> Even when actual paper format would not be a problem, you will need some
> abstraction and/or tools, as you will not find anything manually in an ocean of
> 63,000 elements.
> One area you may want to start looking for tools is in state graphs, where the
> set of possible states of an entire system is unfolded. These things go up to
> over a million states, so you only have a 'small' problem there...
> >> 1) The names of my nodes are not known ahead of time, so I will
> >> extract the title from all the HTML files to name the nodes prior to
> >> parsing the files for hyperlinks (edges).
> >> 2) Every file will be parsed for links and nondirectional connections
> >> will be drawn between the two nodes.
> >> 3) The files might link to each other so the graph package needs to
> >> be able to check to see if an edge between two nodes already exists,
> >> or at least not double draw connections between the two nodes when
> >> adding edges.
> >> I'm relatively new to graph theory so I would greatly appreciate any
> >> suggestions for filetypes. I imagine doing this as a python
> >> dictionary with a list for the edges and a node:list paring is out of
> >> the question for such a large graph?
> > Perhaps a dictionary where the key is a node and the value is a set of
> > destination nodes?
> For undirected edges, you could make an Edge class and have a set of Edge's
> (where two Edge objects are equal when they connect the same nodes).
> I don't expect 500,000 elements in a set to be a problem.
> Albert- Hide quoted text -
> - Show quoted text -
Readability counts: Do you need to be able to examine the datafile by
hand? If not, investigate the 'shelve' module. You can go two ways.
One is to map each node to a list of nodes. It's probably more
intuitive, but it needs repeated code:
shelf['pageA'].add( pageB )
shelf['pageB'].add( pageA )
Which is incorrect as is anyway-- updates to shelved objects aren't
a.add( pageB )
b.add( pageA )
is closer. Otherwise, two is to store a 2-tuple of node keys in a
shelfVerts[ 'pageA' ], shelfVerts[ 'pageB' ]
) ]= True
More information about the Python-list